> I stress test commercially deployed LLMs like Gemini and Claude with trivial t...

qsera · 2026-04-08T17:00:39 1775667639

>95% is not my experience and frankly dishonest.

Quite frankly, this is exactly like how two people can use the same compression program on two different files and get vastly different compression ratios (because one has a lot of redundancy and the other one has not).

simianwords · 2026-04-08T17:01:43 1775667703

I'm asking for a single example.

qsera · 2026-04-08T17:04:24 1775667864

But why do you need an example? Isn't it pretty well understood that LLMS will have trouble responding to stuff that is under represented in the training data?

You will just won't have any clue what that could be.

simianwords · 2026-04-08T17:05:34 1775667934

fair so it must be easy to give an example? I have ChatGPT open with 5.4-thinking. I'm honestly curious about what you can suggest since I have not been able to get it to bullshit easily.

qsera · 2026-04-08T17:15:26 1775668526

I am not the OP, an I have only used ChatGPT free version. Last day I asked it something. It answered. Then I asked it to provide sources. Then it provided sources, and also changed its original answer. When I checked the new answers it was wrong, and when I checked sources, it didn't actually contain the information that I asked for, and thus it hallucinated the answers as well as the sources...

simianwords · 2026-04-08T17:27:32 1775669252

I trust you. If it were happening so frequently you may be able to give me a single prompt to get it to bullshit?

the_snooze · 2026-04-08T19:38:31 1775677111

I did this in one attempt just now: https://gemini.google.com/share/b4e016be1f69

#8 has an incorrect answer (3 appearances according to Gemini, 2 according to reality https://en.wikipedia.org/wiki/Bowl_championship_series#BCS_a...)

So it works well 95% of the time for literally a trivial use case. Imagine if any other tech tool had that kind of reliability: `ls` displays 95% of your files, your phone successfully sends and receives 95% of text messages, or Microsoft Word saving 95% of the characters you typed in. That's just not acceptable.

simianwords · 2026-04-08T20:38:28 1775680708

Hi! The challenge was ChatGPT but even then it looks like you used the weakest version of Gemini.

the_snooze · 2026-04-08T22:27:54 1775687274

>I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks

I did exactly what I said I did. I'm using these systems the way they're designed and advertised. I'm following the happy path with tasks that are small, trivial, and easy to check. This is the charitable approach. Yet the system creaks under the lightest load. If Google wants to put on a better show with stronger models, then they should make those the default.

You don't need to make excuses for shoddy engineering from multi-billion dollar corporations. And you're quite welcome to run the same prompt on ChatGPT and evaluate it on your own time.

simianwords · 2026-04-09T05:52:15 1775713935

Yeah its not too interesting to complain about mistakes from the cheapest model.