Hacker Newsnew | past | comments | ask | show | jobs | submit | bensyverson's commentslogin

The real downside to Google's solution is that you have to use Google Meet. Depending on your opinion of Meet, this is either no big deal or a total deal breaker.

I just updated to 3.5x to get pq support. Anything that might tempt me to upgrade to 4.0?

The top feature, “ Support for Encrypted Client Hello (ECH, RFC 9849)”, is of prime importance to those operating Internet-accessible servers, or clients; hopefully your Postgres server is not one such!

It's a web server (pg / post-quantum, not pg / Postgres), but that's a great feature!

gqgq

Er. Your first acronym is pg not pq. (I had to font test above to be sure!) But point taken! You might care then, I saw various elliptic changes and I assume it’s got pq advancements somewhere in it.


The argument against seat pricing is that companies will employ fewer people in general. So a 45 person company paying for 10 seats will become a 7 person company paying for one seat.

Not sure I agree with this train of thought, but a SaaS CEO made this exact argument to me last week.


I wonder if the PR workflow is just unsustainable in the agentic era. Rather than review every new feature or bug fix, we would depend on good test coverage, and hold developers accountable for what they ship.

The result might be more faulty code getting merged, but if you already have outages and can't review every PR, is there currently a meaningful benefit to the PR workflow?


This is the "if you're already letting faults through, why not give up trying to stop faults?" approach.

The alternative might be "what if we could get the genie back into the bottle?"

We know some people are using LLMs to evaluate PRs, the only question is who, and how strong the incentive is for them to give up.


Diogenes carrying a lamp, looking for good test coverage

Copy-pasting screenshots of red lines.

> I wonder if the PR workflow is just unsustainable in the agentic era. Rather than review every new feature or bug fix, we would depend on good test coverage, and hold developers accountable for what they ship.

I think what you're describing is setting up the human as the fall guy for the machine.


So taking responsibility for the code you generate is being a "fall guy?"

> So taking responsibility for the code you generate is being a "fall guy?"

Yes, if your boss expects you to use AI agents to generate code faster than you can reasonably understand and review it. You're stuck between a rock and a hard place: you're "responsible," but if you take the time to actually be responsible you'll be reprimanded. The environment pushes you to slack on reviews in the short term to keep your head above water, but when a problem happens because of that you'll be blamed for it.


This reminds me a bit of monoliths vs microservices. People would see microservices as the next new shiny thing and bring it with them to their next job, or read a great blog post that sounds great in theory, but falls apart in practice. People would see it as as purely architectural decision. But the reality was that you had to have the organizational structure to support that development model or you'd find out that it just doesn't scale the way you expect and introduces its own sets of problems. My experience is that most teams that didn't have large orgs got bogged down by the weight of microservices (or things called "microservices"). It required a lot of tooling and orchestration to manage. But there was this promise that you could easily just rewrite that microservice from scratch or change languages and nobody would notice or care.

LLM-generated code feels the same. Reviewing LLM-generated code when it's in the context of a monolith is more taxing than reviewing it in the context of the microservice; the blast radius is larger and the risk is greater, as you can make decisions around how important that service actually is for system-wide stability with microservices. You can effectively not care for some services, and can go back and iterate or rewrite it several times over. But more importantly, the organizational structures that are needed to support microservice like architectures effectively also feel like the organizational structures that are needed to support LLM-generated codebases effectively; more silo-ing, more ownership, more contract and spec-based communication between teams, etc. Teams might become one person and an agent in that org structure. But communication and responsibilities feel like they're require something similar to what is needed to support microservices...just that services are probably closer in size to what many companies end up building when they try to build microservices.

And then there are majestic monoliths, very well curated monoliths that feel like a monorepo of services with clear design and architecture. If they've been well managed, these are also likely to work well for agents, but still suffer the same cognitive overhead when reviewing their work because organizationally people working on or reviewing code for these projects are often still responsible for more than just a narrow slice, with a lot of overlap with other devs, requiring more eyes and buy-in for each change as a result.

The organizational structures that we have in place for today might be forced to adapt over time, to silo in ways that ownership and responsibility narrow to fit within what we can juggle mentally. Or they'll be forced to slow down an accept the limitations of the organizational structure. Personal projects have been the area that people have had a lot of success with for LLMs, which feels closer to smaller siloed teams. Open-source collaboration with LLM PRs feels like it falls apart for the same cognitive overhead reasons as existing team structures that adopt AI.


My opinion on "don't re-invent the wheel" has really shifted with these supply chain attacks and the ease of rolling your own with AI.

I agree that I wouldn't roll my own crypto, but virtually anything else? I'm pretty open.


A reply which references neither the parent comment nor the article, but makes a strong and likely negative statement.

This article makes the case for paraxanthine supplements; 80% of caffeine is metabolized into paraxanthine anyway, and it turns out paraxanthine behaves a bit more like we (apparently wrongly) assume caffeine works.

But the real question is: does it taste as good as espresso?


> does it taste as good as espresso?

Coffee is an acquired taste, I think. People conditions themselves to like the bitter taste of coffee over time. I remember hating the taste of coffee (or beer, for example) in childhood.


Weirdly enough, I loved coffee from the first time I tried it, at maybe 13. Even though, looking back, it must have been terrible coffee, it was at something vaguely model UN like thing our entire class went to in an overnight trip. Obviously not enough sleep was had. A vending machine (in the late 90s) provided coffee...

Yes, I also tried coffee first time when in England when 13, and it was like a revelation. I understand that beer and cigarettes are an acquired taste, they tasted terrible, but coffee was a love at first sip.

> it must have been terrible coffee

Douglas Adams nailed the quality of tea from a vending machine, "almost, but not quite, entirely unlike tea", and that era of coffee machines weren't much better at coffee.


I’ve heard that bitterness affects children more intensely. So I wonder how much of it is an acquired taste vs bitterness just becoming “milder” over time.

My three year old loves the taste of matcha. Even when I don't prepare it quite right and it turns out very bitter. He's pretty picky about near everything else. I think it's acquisition through mimicry.

Matcha is one of the more concentrated amino acids drinks you can make; given how hungry I remember being as a kid, I bet it tastes like liquid gold. And if you’re in a climate that tolerates rhododendrons you can plant a camellia sinensis bush for it straight from the vine as a bridge from matcha to steeped tea, steaming and roasting, etc.

> Matcha is one of the more concentrated amino acids drinks

Matcha is virtually entirely water. Multiple sources say that matcha has about 270 mg of amino acids per serving. Even if matcha powder were 100% amino acids (which would taste vile), a 2g serving would still be 2g.

Milk has about 4.5 grams of amino acid content per 100g (less than half a cup).


Yes, that’s one of the reasons dairy matcha lattes taste so good: not only is it more densely amino, but it’s also more broadly amino (e.g. milk is not particularly high in L-theanine), and the sweetness of the milk offsets the bitterness of the matcha, which lets you ramp up the density further beyond 2g if you like.

What I mean to say is that matcha is almost devoid of amino acid content. It’s basically a small cup of water. The small amounts of various compounds may have some beneficial effects, but amino acids are abundant in many foods and drinks. You don’t need to get them in micro doses from matcha.

Matcha may be tasty. It’s not a good source of aminos.


Okay.

It's definitely an acquired taste. But an espresso doesn't have to be overwhelmingly bitter! It can be almost sweet if it's extracted well.

Regular coffee, too, can be very delicate, minimally bitter, giving herbal or strong tea-like notes, among other things.

I've not gotten that kind of profile out of anything but fairly-expensive beans roasted within the last couple of weeks, though. I've never seen it out of even mid-priced beans, nor anything nationally distributed. It's practically a totally different drink from what you get if you ask for a coffee in most contexts.

Iced coffee and cold brew are also fairly different. I find middling beans can make a much milder and more pleasant cold brew coffee than hot. Tiny (like, a teaspoon) splash of cream or milk and it takes the bitter edge all but completely off, to my taste anyway.


> I've not gotten that kind of profile out of anything but fairly-expensive beans roasted within the last couple of weeks, though.

Good beans will last more than 2 weeks, but yes—just as you wouldn't judge all sushi based on gas station sushi, we shouldn't judge coffee based on months-old pre-ground grocery store roasts.


> we shouldn't judge coffee based on months-old pre-ground grocery store roasts.

Next you'll be claiming that we shouldn't judge sushi based on months-old grocery store sushi.


>Coffee is an acquired taste, I think

Billions all over the world managed to acquire it just fine.

If that's an acquired taste, I doubt 99% of drinks that aren't an acquired taste would do much better, assuming there's anything doing better than coffee to begin with.

Not even Cola and tea come close.


I now like bitters and soda, and I didn’t like bitter as a kid, so I think there might also be shifts in favor of bitter unrelated to coffee. Perhaps the same thing that leads people to appreciate spicy or sour as experiences broaden.

Years ago, somewhere I read that children have a genetically-based urge to avoid bitter flavors, since they may signal natural poisons, whereas adults can judge better, so the urge is lessened.

(And even if that source were true, that wouldn't make the genetic effect an absolute; it would depend on individual genetics and the variable expression of those genes. And probably on the individual's experience, either as a child or as an adult.)


but why? i have to add so much milk and sugar to mask the bitterness that, combined with the negative effects, i asked myself, why do i even bother? i might as well just drink hot milk with sugar instead. now i only drink coffee if i need the energy and waking effects and nothing else sugary is available, which happens once a year, at most.

This is not universal. I only drink espresso without sugar or milk, because I love the taste of a strong coffee.

That is entirely dependent on your diet as a child. I know children that love bitter or sour/fermented foods. Not to mention they dislike things that are overly sweet.

I wouldn't be surprised if all tastes are essentially "acquired".


Speak for yourself. The bitter taste is what I like. When I don't like a cup of coffee, it's always going to be for being too sour. (Which can be masked pretty well with milk or a substitute, mind.)

I agree with your main point, though. I hated coffee most of my life. Even the smell made me feel ill. At some point, I flipped. I've always liked tea, fwiw.

I guess I don't hate beer as much as I used to. Still don't like it, though. Maybe another few decades?


> But the real question is: does it taste as good as espresso?

I don’t know where you live, but in Italy it’s extremely difficult to find a good espresso; you must go in "specialty coffee" places to taste real coffee, as all the bars use cheap coffee that tastes burnt. Ironically, it’s a country that takes pride in its coffee "tradition" but doesn’t know what coffee tastes like. The experience is the same in France, without the "tradition" thing.


I'd imagine it would not be hard to breed/engineer a coffee plant that produces more paraxanthine than caffeine. The plants take 5 years to mature so getting a crop to market would take a while though.

I accustomed myself to drinking coffee black. Then decaf. And later I tried camomile tea.

I found the need I really needed satisfied was a warm cup of something to curl my hands around in the morning, and they all worked after I let them. ymmv.


I can understand espresso being drunk for many reasons, but none of them are "tasting good".

I actually envy you, because having my first truly good espresso was an experience I wish I could relive

Cafes that care about their coffee can have very good tasting espresso, but cafes that don't care will produce burnt bitter water. There's also Cafecito, which is basically liquid crack.

> There's also Cafecito, which is basically liquid crack.

Is that a warning or an endorsement?


Both!

I get the frustration, but it's reductive to just call LLMs "bullshit machines" as if the models are not improving. The current flagship models are not perfect, but if you use GPT-2 for a few minutes, it's incredible how much the industry has progressed in seven years.

It's true that people don't have a good intuitive sense of what the models are good or bad at (see: counting the Rs in "strawberry"), but this is more a human limitation than a fundamental problem with the technology.


Two things can be true at the same time: The technology has improved, and the technology in its current state still isn't fit for purpose.

I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks: sports trivia, fixing recipes, explaining board game rules, etc. It works well like 95% of the time. That's fine for inconsequential things. But you'd have to be deeply irresponsible to accept that kind of error rate on things that actually matter.

The most intellectually honest way to evaluate these things is how they behave now on real tasks. Not with some unfalsifiable appeal to the future of "oh, they'll fix it."


The errors are also not distributed in the same way as you'd expect from a human. The tools can synthesize a whole feature in a moderately complicated web app including UI code, schema changes, etc, and it comes out perfectly. Then I ask for something simple like a shopping list of windshield wipers etc for the cars and that comes out wildly wrong (like wrong number of wipers for the cars, not just the wrong parts), stuff that a ten year old child would have no trouble with. I work in the field so I have a qualitative understanding of this behavior but I think it can be extremely confusing to many people.

One of the reasons I'm comfortable using them as coding agents is that I can and do review every line of code they generate, and those lines of code form a gate. No LLM-bullshit can get through that gate, except in the form of lines of code, that I can examine, and even if I do let some bullshit through accidentally, the bullshit is stateless and can be extracted later if necessary just like any other line of code. Or, to put it another way, the context window doesn't come with the code, forming this huge blob of context to be carried along... the code is just the code.

That exposes me to when the models are objectively wrong and helps keep me grounded with their utility in spaces I can check them less well. One of the most important things you can put in your prompt is a request for sources, followed by you actually checking them out.

And one of the things the coding agents teach me is that you need to keep the AIs on a tight leash. What is their equivalent in other domains of them "fixing" the test to pass instead of fixing the code to pass the test? In the programming space I can run "git diff *_test.go" to ensure they didn't hack the tests when I didn't expect it. It keeps me wondering what the equivalent of that is in my non-programming questions. I have unit testing suites to verify my LLM output against. What's the equivalent in other domains? Probably some other isolated domains here and there do have some equivalents. But in general there isn't one. Things like "completely forged graphs" are completely expected but it's hard to catch this when you lack the tools or the understanding to chase down "where did this graph actually come from?".

The success with programming can't be translated naively into domains that lack the tooling programmers built up over the years, and based on how many times the AIs bang into the guardrails the tools provide I would definitely suggest large amounts of skepticism in those domains that lack those guardrails.


> the technology in its current state still isn't fit for purpose.

This is a broad statement that assumes we agree on the purpose.

For my purpose, which is software development, the technology has reached a level that is entirely adequate.

Meanwhile, sports trivia represents a stress test of the model's memorized world knowledge. It could work really well if you give the model a tool to look up factual information in a structured database. But this is exactly what I meant above; using the technology in a suboptimal way is a human problem, not a model problem.


There's nothing in these models that say its purpose is software development. Their design and affordances scream out "use me for anything." The marketing certainly matches that, so do the UIs, so do the behaviors. So I take them at their word, and I see that failure modes are shockingly common even under regular use. I'm not out to break these things at all. I'm being as charitable and empirical as I can reasonably be.

If the purpose is indeed software development with review, then there's nothing stopping multi-billion dollar companies from putting friction into these sytems to direct users towards where the system is at its strongest.


The LLM vendors are selling tokens. Why would they put friction into selling more tokens? Caveat emptor.

> I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks: sports trivia, fixing recipes, explaining board game rules, etc. It works well like 95% of the time. That's fine for inconsequential things. But you'd have to be deeply irresponsible to accept that kind of error rate on things that actually matter.

95% is not my experience and frankly dishonest.

I have ChatGPT open right now, can you give me examples where it doesn't work but some other source may have got it correct?

I have tested it against a lot of examples - it barely gets anything wrong with a text prompt that fits a few pages.

> The most intellectually honest way to evaluate these things is how they behave now on real tasks

A falsifiable way is to see how it is used in real life. There are loads of serious enterprise projects that are mostly done by LLMs. Almost all companies use AI. Either they are irresponsible or you are exaggerating.

Lets be actually intellectually honest here.


>95% is not my experience and frankly dishonest.

Quite frankly, this is exactly like how two people can use the same compression program on two different files and get vastly different compression ratios (because one has a lot of redundancy and the other one has not).


I'm asking for a single example.

But why do you need an example? Isn't it pretty well understood that LLMS will have trouble responding to stuff that is under represented in the training data?

You will just won't have any clue what that could be.


fair so it must be easy to give an example? I have ChatGPT open with 5.4-thinking. I'm honestly curious about what you can suggest since I have not been able to get it to bullshit easily.

I am not the OP, an I have only used ChatGPT free version. Last day I asked it something. It answered. Then I asked it to provide sources. Then it provided sources, and also changed its original answer. When I checked the new answers it was wrong, and when I checked sources, it didn't actually contain the information that I asked for, and thus it hallucinated the answers as well as the sources...

I trust you. If it were happening so frequently you may be able to give me a single prompt to get it to bullshit?

I did this in one attempt just now: https://gemini.google.com/share/b4e016be1f69

#8 has an incorrect answer (3 appearances according to Gemini, 2 according to reality https://en.wikipedia.org/wiki/Bowl_championship_series#BCS_a...)

So it works well 95% of the time for literally a trivial use case. Imagine if any other tech tool had that kind of reliability: `ls` displays 95% of your files, your phone successfully sends and receives 95% of text messages, or Microsoft Word saving 95% of the characters you typed in. That's just not acceptable.


Hi! The challenge was ChatGPT but even then it looks like you used the weakest version of Gemini.

>I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks

I did exactly what I said I did. I'm using these systems the way they're designed and advertised. I'm following the happy path with tasks that are small, trivial, and easy to check. This is the charitable approach. Yet the system creaks under the lightest load. If Google wants to put on a better show with stronger models, then they should make those the default.

You don't need to make excuses for shoddy engineering from multi-billion dollar corporations. And you're quite welcome to run the same prompt on ChatGPT and evaluate it on your own time.


Yeah its not too interesting to complain about mistakes from the cheapest model.

Which things actually matter? I think we can all agree that an LLM isn't fit for purpose to control a nuclear power plant or fly a commercial airliner. But there's a huge spectrum of things below that. If an LLM trading error causes some hedge fund to fail then so what? It's only money.

Not to mention that it would then make some hedge fund with a better backtesting harness or more AI scrutiny more successful thus keeping the financial market work as designed.

Six months bro, we're still so early

Whether LLMs can create correct content doesn't matter. We've already seen how they are being used and will be used.

Fake content and lies. To drive outrage. To influence elections. To distract from real crimes. To overload everyone so they're too tired to fight or to understand. To weaken the concept that anything's true so that you can say anything. Because who cares if the world dies as long as you made lots of money on the way.


> Because who cares if the world dies as long as you made lots of money on the way.

Guiding principle of the AI industry


It's really the whole tech industry as it exists right now and AI is a victim of bad timing. If this AI had been invented 40 years ago there'd have been a lower ceiling on the damage it could do.

Another way of saying that is that capitalism is the real problem, but I was never anti-capitalist in principle, it's just gotten out of hand in the last 5-10 years. (Not that it hadn't been building to that.)


> Another way of saying that is that capitalism is the real problem, but I was never anti-capitalist in principle, it's just gotten out of hand in the last 5-10 years. (Not that it hadn't been building to that.)

Capitalism is a tool and it's fine as a tool, to accomplish certain goals while subordinated to other things. Unfortunately it's turned into an ideology (to the point it's worshiped idolatrously by some), and that's where things went off the rails.


Agree. Capitalism is good in limited domains. Applying it generally is ludicrously stupid and will lead to another revolution in the West unless we get it under control

Computer graphics have been improving for decades but the uncanny valley remains undefeated. I don't know why anyone expects a breakthrough in other areas. There's a wall we hit and we don't understand our own consciousness and effectiveness well enough to replicate it.

We have credible deepfakes on demand. (To be fair, there have been deceptive photos as long as photos have existed, but the cost of automating their creation going to basically zero has a social impact)

We can use AI to make video clips to trick boomers on Facebook into thinking Obama eats babies. They already want to believe it. AI isn't outputting real full-length books and movies.

In computer graphics we understand how it works, we just lack the computational power to do it real time, but we can with sufficient processing produce realistic looking images with physically accurate lighting. But when it comes to cognition its a lot of guesswork, we haven't yet mapped out the neuron connections in a brain, we haven't validated it works as popular science writing suggests. We don't understand intelligence, so all we can do is accidentally bumble into it and it seems unlikely that will just happen especially when its so hard to compute what we are already doing.

That's not why the author calls them bullshit machines.

> One way to understand an LLM is as an improv machine. It takes a stream of tokens, like a conversation, and says “yes, and then…” This yes-and behavior is why some people call LLMs bullshit machines. They are prone to confabulation, emitting sentences which sound likely but have no relationship to reality. They treat sarcasm and fantasy credulously, misunderstand context clues, and tell people to put glue on pizza.

Yes, there have been improvements on them, but none of those improvements mitigate the core flaw of the technology. The author even acknowledges all of the improvements in the last few months.


Bullshit is the perfect term here, even as AI's get so much better and capable Brandolini's Law aka the "bullshit asymmetry principle" always applies--the energy required to refute misinformation is an order of magnitude larger than that needed to produce it. Even to use AIs effectively today requires a very good BS detector--some day in the future it won't.

models are improving. the pricing already assumes they're ready for prod. that's where the fires start

it's not a bullshit machine because its output is bad, it's a bullshit machine because its output is literally 'bullshit' as in, output that is statistically likely but with no factual or reasoning basis. as the models have improved, their bullshit is more statistically likely to sound coherent (maybe even more likely to be 'accurate'), but no more factual and with no more reasoning.

However, when fed source material into the context they will lie less, right? So at this point is it not just a battle of the nines until it's called "good enough"?

I also wonder if I leave my secretary with a ream of papers and ask him for a summary how many will he actually read and understand vs skim and then bullshit? It seems like the capacity for frailty exists in both "species".


Calling LLMs "bullshit machines" is a reference to a 2024 paper [1] which itself uses the concept of "bullshit" as defined in the essay/book "On Bullshit" by Harry G. Frankfurt [2]. The TL;DR is that LLMs are fundamentally bullshit machines because they are only made to generate sentences that sound plausible, but plausible does not always mean true.

[1]: https://link.springer.com/article/10.1007/s10676-024-09775-5

[2]: https://en.wikipedia.org/wiki/On_Bullshit


It doesn't matter how good the models become. They can only deal in bullshit, in the academic use of the term.

They are bullshit machines because they do not have an internal mental model of truth like a human does. The flagship models bullshit less, but their fundamental architectures prevent having truth interfere with output.

https://philosophersmag.com/large-language-models-and-the-co...


"Bullshit" is a human concept. LLMs do not work like the human brain, so to call their output "bullshit" is ascribing malice and intent that is simply not there. LLMs do not "think." But that does not mean they're not incredibly powerful and helpful in the right context.

I sort of agree. In this context "bullshit" means "speech intended to persuade without regard for truth", and while it's true that LLM output is without regard for truth, it's not an entity capable of the agency to persuade, although functionally that is what it can appear like.

https://en.wikipedia.org/wiki/On_Bullshit


> it's reductive to just call LLMs "bullshit machines" as if the models are not improving

This is true, but I prefer to think of it as "It's delusional to pretend as if human beings are not bullshit machines too".

Lies are all we have. Our internal monologue is almost 100% fantasy. Even in serious pursuits, that's how it works. We make shit up and lie to ourselves, and then only later apply our hard-earned[1] skill prompts to figure out whether or not we're right about it.

How many times have the nerds here been thinking through a great new idea for a design and how clever it would be before stopping to realize "Oh wait, that won't work because of XXX, which I forgot". That's a hallucination right there!

[1] Decades of education!


I'm not entirely sure I can agree, although the premise is seductive in certain ways. We do lie to ourselves, but we also have meta-cognition - we can recognise our own processes of thought. Imperfect as it may be, we have feedback loops which we can choose to use, we have heuristics we can apply, we can consciously alter our behaviour in the presence of contextual inputs, and so on.

Being wrong is not the same as a hallucination. It's a natural step on a journey to being more right. This feels a bit like Andreesen proudly stating he avoids reflection - you can act like that, but the human brain doesn't have to. LLMs have no choice in the matter.


The problem, unfortunately, is the scale. It's always scale. Humans make all the kinds of mistakes that we ascribe to LLMs, but LLMs can make them much faster and at much larger scale.

Models have gotten ridiculously better, they really have, but the scale has increased too, and I don't think we're ready to deal with the onslaught.


Scale is very different, but I wonder if human trust isn't the real issue. We trust technology too much as a group. We expect perfection, but we also assume perfection. This might be because the machines output confident sounding answers and humans default to trusting confidence as an indirect measure for accuracy, but I think there is another level where people just blindly trust machines because they are so use to using them for algorithms that trend towards giving correct responses.

Even before LLMs where in the public's discourse, I would have business ask about using AI instead of building some algorithm manually, and when I asked if they had considered the failure rate, they would return either blank stares or say that would count as a bug. To them, AI meant an algorithm just as good as one built to handle all edge cases in business logic, but easier and faster to implement.

We can generally recognize the AIs being off when they deal in our area of expertise, but there is some AI variant of Gell-Mann Amnesia at play that leads us to go back to trusting AI when it gives outputs in areas we are novices in.


Humans are different. Humans - at least thoughtful humans - know the difference between knowing something and not knowing something. Humans are capable of saying "I don't know" - not just as a stream of tokens, but really understanding what that means.

> Humans - at least thoughtful humans - know the difference between knowing something and not knowing something.

Your no-true-scotsman clause basically falsifies that statement for me. Fine, LLMs are, at worst I guess, "non-thoughtful humans". But obviously LLMs are right an awful lot (more so than a typical human, even), and even the thoughtful make mistakes.

So yeah, to my eyes "Humans are NOT different" fits your argument better than your hypothesis.

(Also, just to be clear: LLMs also say "I don't know", all the time. They're just prompted to phrase it as a criticism of the question instead.)


Disagree. If you went to 100 random humans and said, "Tell me about the Siberian marmoset", what fraction would make up completely random nonsense to spew back at you? More than zero, sure, but most of them would say "what are you talking about?" or some variation.

I asked Claude Opus 4.6, Sonnet 4.6, Gemini 3 Thinking, and Gemini 3 Fast "Tell me about the Siberian marmoset" exactly and all 4 said it doesn't exist, with Gemini Thinking suggesting that I'm thinking of the Siberian marmot or Siberian chipmunk (both real animals).

https://en.wikipedia.org/wiki/Tarbagan_marmot (also known as Siberian marmot)

https://en.wikipedia.org/wiki/Siberian_chipmunk


So your logic is humans and LLMs are the same because humans are wrong sometimes?

Pretty much, yeah. Or rather, the fact that we're both reliably wrong in identifiably similar ways makes "we're more alike than different" an attractive prior to me.

“More alike than different” is reasonable I think, as long as we’re talking about how we have some of the same failure modes. Although the way we get there is quite different.

I’m still not a big fan of comparing humans and LLMs because LLMs lack so much of what actually makes us human. We might bullshit or be wrong because of many reasons that just don’t apply to LLMs.


"Lies are all we have."

If so, how do we distinguish between code that works and code that doesn't work? Why should we even care?


> If so, how do we distinguish between code that works and code that doesn't work?

Hilariously, not by using our brains, that's for sure. You have to have an external machine. We all understand that "testing" and "code review" are different processes, and that's why.


Good point. We choose certain tests to perform. We choose certain test results to pay attention to. We don't just keep chatting about (reviewing) the code. We do something else.

If lies are all we have, then how is this behavior possible?


LLMs can write and run tests though.

You're cherry picking my little bit of wordsmithing. Obviously we aren't always wrong. I'm saying that our thought processes stem from hallucinatory connections and are routinely wrong on first cut, just like those of an LLM.

Actually I'm going farther than that and saying that the first cut token stream out of an AI is significantly more reliable than our personal thoughts. Certainly than mine, and I like to think I'm pretty good at this stuff.


I don't think the complaint about cherry picking is quite fair. Most of your original comment consists of claims that we're bullshit machines, our internal dialog is almost 100% fantasy, we're hallucinating, etc. Those claims may be true. But I'm not carefully like curating them out of nowhere.

Well said. I use bunny.net for many of the same reasons, and to support diversity of solutions in the internet ecosystem.

I think we're entering an era where "re-inventing the wheel" is actually a completely valid defensive posture. The cost is so low relative to the reduction in risk.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: