More

antirez · 2026-04-09T18:23:15 1775758995

Very good move. In my experience, for system programming at least, GPT 5.4 xhigh is vastly superior to Claude Opus 4.6 max effort. I ran many brutal tests, including reconstructing for QEMU the SCSI controller (not longer accessible) of a SVSY UNIX of the early 90s used in a 386. Side by side, always re-mirroring the source trees each time one did a breakthrough in the implementation. Well, GPT 5.4 single handed did it all, while Opus continued to take wrong paths. The same for my Redis bug tracking and development. But 200$ is too much for many people (right now, at least: the reality is that if frontier LLMs are not democratized, we will end paying like a house rent to a few providers), and also while GPT 5.4 is much stronger, it is slower and less sharp when the thing to do is simple, so many people went for Claude (also because of better marketing and ethical concerns, even if my POV is different on that side: both companies sell LLM models with similar capabilities and similar internal IP protection and so forth, to me they look very similar in practical terms). This will surely change things, and many people will end with a Claude 5x account + a Codex 5x account I bet.

dweekly · 2026-04-09T19:08:22 1775761702

GPT 5.4 is the surly physics PhD post-doc who slowly and angrily sits in a basement to write brilliant, undocumented, uncommented code that encapsulates a breakthrough algorithm.

Opus 4.6 is the L5 new hire SWE keen to prove their chops and quickly turn out totally reasonable code with putatively defensible reasons for doing it that way (that are sometimes tragically wrong) and then catch an after-work yoga class with you.

pdntspa · 2026-04-09T19:44:00 1775763840

Who replies to you with fucking emoji brainrot

ponector · 2026-04-09T20:41:16 1775767276

You are absolutely right!

simianwords · 2026-04-09T19:34:01 1775763241

GPT is also cautious and Defensive but opus is agreeable.

fragmede · 2026-04-09T19:43:20 1775763800

> and then catch an after-work yoga class with you.

That's cute, but do you mean something concrete with this, aka are there some non-coding prompting you use it for that you're referring to with that or is it simply a throwaway line about L5 SWEs (at a FAANG).

(FWIW, I find myself using ChatGPT for non-coding prompting for some reason, like random questions like if oil is fungible and not Claude, for some reason.)

dghlsakjg · 2026-04-09T21:30:26 1775770226

It’s an analogy about the “personalities” of the models.

They are saying that Claude is more of a team player and conformist. It isn’t really much deeper than that.

joncrane · 2026-04-09T20:09:41 1775765381

I think the point they are trying to make is the golden retriever vibe/energy you get from Claude gives "after work yoga."

Tiberium · 2026-04-09T18:32:05 1775759525

Thanks for confirming my impressions, it's been like 4 months now that I've arrived at the same conclusions. GPT models are just better at any kind of low-level work: reverse engineering including understanding what the decompiled code/assembly does, renaming that decompiled code (functions/types), any kind of C/C++, way more reliable security research (Opus will find way more, but most will turn out to be false positives). I've had GPT create non-trivial custom decompilers for me for binaries built with specific compilers (it's a much simpler task than what IDA Pro/Ghidra are doing but still complex), and modify existing Java decompilers.

Regarding speed, I don't use xhigh that often, and surprisingly for me GPT 5.4 high is faster than Claude 4.6 Opus high (unless you enable fast mode for Opus).

Of course I still use Opus for frontend, for some small scripts, and for criticizing GPT's code style, especially in Python (getattr).

beering · 2026-04-09T22:01:38 1775772098

Codex also gives you a lot more usage for $20/mon than Claude, so there’s not also that fear that high or xhigh reasoning will eat up all your quota. It really comes down to whether you want to try to save some time or not. (I default to xhigh because it’s still fast enough for me.)

antirez · 2026-04-09T18:35:38 1775759738

In the SCSI controller work I mentioned, a very big part of the work was indeed reasoning about assembly code and how IRQs and completion of DMAs worked and so forth. Opus, even if TOOLS.md had the disassembler and it was asked to use it many times, didn't even bothered much. GPT 5.4 did instead a very great reverse engineering work, also it was a lot more sensible to my high level suggestions, like: work in that way to make more isolated progresses and so forth.

amluto · 2026-04-09T18:51:10 1775760670

GPT 5.4 is remarkably good at figuring out machine code using just binutils. Amusingly, I watched it start downloading ghidra, observe that the download was taking a while, and then mostly succeed at its assignment with objdump :)

Asyne · 2026-04-09T18:41:33 1775760093

+1 to this, I've found GPT/Codex models consistently stronger in engineering tasks (such as debugging complex, cross-systems issues, concurrency problems, etc).

I use both OpenAI and Anthropic models, though for different purposes, what surprises me is how underrated GPT still feels (or, alternatively, how overhyped Anthropic models can be) given how capable it is in these scenarios. There also seems to be relatively little recognition of this in the broader community (like your recent YouTube video). My guess is that demand skews toward general codegen rather than the kind of deep debugging and systems work where these differences really show.

beering · 2026-04-09T22:05:55 1775772355

Or rather, it’s hard to ask everyone to side-by-side compare both products on their use cases. So the choice really comes down to word-of-mouth even though their use cases may be better served by Codex.

mediaman · 2026-04-09T18:59:22 1775761162

It's surprising to me how much LLM "personality" seems to matter to people, more than actual capability.

I do turn to Anthropic for ideation and non-tech things. But I find little reason to use it over codex for engineering tasks. Sometimes for planning, but even there, 5.4 is more critical of my questionable ideas, and will often come up with simpler ways to do things (especially when prompted), which I appreciate.

And I don't do hard-tech things! I've chosen a b2b field where I can provide competent products for a niche that is underserved and where long term relationships matter, simply because I'm not some brilliant engineer who can completely reinvent how something is done. I'm not writing kernels or complex ML stacks. So I don't really understand what everyone is building where they don't see the limits of Opus. Maybe small greenfield projects with few users.

fcarraldo · 2026-04-09T19:06:02 1775761562

> It's surprising to me how much LLM "personality" seems to matter to people, more than actual capability. > I do turn to Anthropic for ideation and non-tech things. But I find little reason to use it over codex for engineering tasks. Sometimes for planning, but even there, 5.4 is more critical of my questionable ideas, and will often come up with simpler ways to do things (especially when prompted), which I appreciate.

Aren't you saying here that the LLM personality matters to you, too? Being critical of you is a personality attribute, not a capabilities one.

lo_zamoyski · 2026-04-09T19:31:14 1775763074

Not necessarily. Criticism is the analysis, evaluation, or judgment of the qualities of something. This is a matter of intellectual act. However, you could say that being habitually critical can be partly a result of "personality" or temperament.

(Of course, strictly speaking, LLMs have neither temperament, "personality", nor intellect, but we understand these terms are used in an analogical or figurative fashion.)

randomNumber7 · 2026-04-09T19:48:53 1775764133

> I'm not some brilliant engineer who can completely reinvent how something is done

With an honest evaluation of your own capabilities you are already far above average. Also its hard to see the insane amount of work that often was necessary to invent the brilliant stuff and most people can not shit that out consistently.

dvfjsdhgfv · 2026-04-09T20:13:37 1775765617

I use codex for cleaning after cloude and it always finds so many bugs, some of them quite obvious.

thisisit · 2026-04-09T20:25:53 1775766353

My non scientific tests has been that GPT models follow the prompts literally. Every time I give it an example, it uses the example in literal sense instead of using it to enhance its understanding of the ask. This is a good thing if I want it to follow instructions but bad if I want it to be creative. I have to tell it that the examples I gave are just examples and not to be used in output. I feel comfortable using it when I have everything mapped out.

Claude on the other hand can be creative. It understands that examples are for reference purposes only. But there are times it decides to off on a tangent on its own and decide not to follow instructions closely. I find it useful for bouncing off ideas or test something new,

The other thing I notice is Claude has slightly better UI design sensibilities even if you don’t give instructions. GPT on the other hand needs instructions otherwise every UI element will be so huge you need to double scroll to find buttons.

veber-alex · 2026-04-09T20:36:34 1775766994

This is also what I noticed.

GPT doesn't know how to get creative, you need to tell it exactly what to do and what code you want it to write.

For Claude you can be more general and it will look up solutions for you outside of the scope you gave it.

I presonaly prefer Claude.

sixothree · 2026-04-09T21:33:55 1775770435

I think you might benefit from the "superpower" plugin. Add the word "brainstorm" before your prompt and it does a little bit better at figuring out how you want things.

postalcoder · 2026-04-09T19:22:57 1775762577

What I like most about gpt coding models is how predictable of a lever that thinking effort is.

Xhigh will gather all the necessary context. low gathers the minimum necessary context.

That doesn’t work as well with me for Opus. Even at max effort it’ll overlook files necessary to understanding implementations. It’s really annoying when you point that out and you get hit with an”you’re absolutely right”.

Codex isn’t the greatest one shot horse in the race but, once you figure out how to harness it, it’s hard to go back to other models.

osti · 2026-04-09T19:59:28 1775764768

Yup I've mentioned this in another thread, I got gpt 5.4xhigh to improve the throughout of a very complex non typical CUDA kernel by 20x. This was through a combination of architecture changes and then do low level optimizations, it did the profiling all by itself. I was extremely impressed.

bob1029 · 2026-04-09T18:53:22 1775760802

GPT5.4 with any effort level is scary when you combine it with tricks like symbolic recursion. I actually had to reduce the effort level to get the model to stop trying to one shot everything. I struggled to come up with BS test cases it couldn't dunk in some clever way. Turning down the reasoning effort made it explore the space better.

rolls-reus · 2026-04-09T19:02:46 1775761366

can you explain what you mean by symbolic recursion tricks in this context?

bob1029 · 2026-04-09T20:04:54 1775765094

The model can call a copy of itself as a tool (i.e., we maintain actual stack frames in the hosting layer). Explicit tools are made available: Call(prompt) & Return(result).

The user's conversation happens at level 0. Any actual tool use is only permitted at stack depths > 0. When the model calls the Return tool at stack depth 0 we end that logical turn of conversation and the argument to the tool is presented to the user. The user can then continue the conversation if desired with all prior top level conversation available in-scope.

It's effectively the exact same experience as ChatGPT, but each time the user types a message an entire depth-first search process kicks off that can take several minutes to complete each time.

SunshineTheCat · 2026-04-09T19:14:56 1775762096

1000%. I have been running claude's work through codex for about a week now and it's insane the number of mistakes it catches. Not really sure why I've been doing this, just interesting to watch I guess.

Not to mention a billion times more usage than you get with claude, dollar for dollar.

scrollop · 2026-04-09T20:19:49 1775765989

It's widely reported that opus has been greatly reduced for a number of weeks since Mythos was released internally

sho_hn · 2026-04-09T19:05:19 1775761519

Same for me, cf. https://news.ycombinator.com/item?id=47680123

zozbot234 · 2026-04-09T18:40:52 1775760052

The $100/mo giving access to GPT Pro (with reduced usage) is a nice counter to the just teased Claude Mythos. But GPT 5.4 xhigh being able to perform that kind of low-level reconstruction task is very impressive already.

aerhardt · 2026-04-09T18:41:50 1775760110

I completely agree with you on both the technical and ethical reasoning.

Thank you for speaking out. I think it's important that reputable engineers like you do so. The Claude gang gaslighting is unhinged right now. It would be none of my concern but I have to deal with it in the real world - my customers are susceptible to these memes. I'm sure others have to deal with similar IRL consequences, too.

antirez · 2026-04-04T15:13:33 1775315613

That's not what is happening right now. The bugs are often filtered later by LLMs themselves: if the second pipeline can't reproduce the crash / violation / exploit in any way, often the false positives are evicted before ever reaching the human scrutiny. Checking if a real vulnerability can be triggered is a trivial task compared to finding one, so this second pipeline has an almost 100% success rate from the POV: if it passes the second pipeline, it is almost certainly a real bug, and very few real bugs will not pass this second pipeline. It does not matter how much LLMs advance, people ideologically against them will always deny they have an enormous amount of usefulness. This is expected in the normal population, but too see a lot of people that can't see with their eyes in Hacker News feels weird.

uhx · 2026-04-04T17:10:32 1775322632

> Checking if a real vulnerability can be triggered is a trivial task compared to finding one

Have you ever tried to write PoC for any CVE?

This statement is wrong. Sometimes bug may exist but be impossible to trigger/exploit. So it is not trivial at all.

avemg · 2026-04-04T18:36:30 1775327790

I'm tickled at the idea of asking antirez [1] if he's ever written a PoC for a CVE.

[1] https://en.wikipedia.org/wiki/Salvatore_Sanfilippo

jedberg · 2026-04-04T19:56:18 1775332578

I actually like when that happens. Like when people "correct" me about how reddit works. I appreciate that we still focus on the content and not who is saying it.

tptacek · 2026-04-04T20:14:15 1775333655

That's not really what happened on this thread. Someone said something sensible and banal about vulnerability research, then someone else said do-you-even-lift-bro, and got shown up.

jedberg · 2026-04-04T20:47:22 1775335642

That's true in this particular case, but I was talking more about the general case.

tptacek · 2026-04-04T19:30:59 1775331059

This happens over and over in these discussions. It doesn't matter who you're citing or who's talking. People are terrified and are reacting to news reflexively.

antirez · 2026-04-04T22:49:13 1775342953

Hi! Loved your recent post about the new era of computer security, thanks.

tptacek · 2026-04-05T03:12:44 1775358764

Thank you! Glad you liked it.

emp17344 · 2026-04-04T21:39:17 1775338757

Personally, I’m tired of exaggerated claims and hype peddlers.

Edit: Frankly, accusing perceived opponents of being too afraid to see the truth is poor argumentative practice, and practically never true.

LeFantome · 2026-04-04T19:07:12 1775329632

Sure he wrote a port scanner that obscures the IP address of the scanner, but does he know anything about security? /s

Oh, and he wrote Redis. No biggie.

PunchyHamster · 2026-04-04T20:29:35 1775334575

That's both wholly different branches than finding software bugs

antirez · 2026-04-04T17:37:17 1775324237

Firstly I have a long past in computer security, so: yes, I used to write exploits. Second, the vulnerability verification does not need being able to exploit, but triggering an ASAN assert. With memory corruption that's very simple often times and enough to verify the bug is real.

uhx · 2026-04-06T13:26:51 1775482011

Thank you for clarification. It actually helped: at first I was overcomplicating it in my head.

After thinking about it for an hour I came up with this:

LLM claims that there is a bug. We dont know whether it really exist. We run a second LLM that is capable to write unit-tests/reproducer (dont have to be E2E, shorter data flow -> bigger success rate for LLM), compile program and run the test for ASAN assert. ASAN error means proven bug. No error, as you said, does not prove anything, because it may simply mean LLM failed to write a correct test.

Still don't know how much $ it would cost for LLM reasoning, but this technically should work much better than manually investigating everything.

Sorry for "have-you-ever" thing :)

freedomben · 2026-04-04T17:18:34 1775323114

I'm not GP, but I've written multiple PoCs for vulns. I agree with GP. Finding a vuln is often very hard. Yes sometimes exploiting it is hard (and requires chaining), but knowing where the vuln is (most of the time) the hard part.

e12e · 2026-04-04T18:26:31 1775327191

Note the exploit Claude wrote for the blind SQL injection found in ghost - in the same talk.

https://youtu.be/1sd26pWhfmg?is=XLJX9gg0Zm1BKl_5

orochimaaru · 2026-04-04T19:18:24 1775330304

oh no. Antirez doesn't know anything about C, CVE's, networking, the linux kernel. Wonder where that leaves most of us.

discordianfish · 2026-04-04T17:36:22 1775324182

I’ve been around long enough to remember people saying that VMs are useless waste of resources with dubious claims about isolation, cloud is just someone else’s computer, containers are pointless and now it’s AI. There is a astonishing amount of conservatism in the hacker scene..

pdntspa · 2026-04-04T17:38:37 1775324317

Well, the cloud is someone else's computer.

some_random · 2026-04-04T18:30:40 1775327440

It is, but that's not a useful or insightful thing to say

Calavar · 2026-04-04T19:54:28 1775332468

It's not an insightful statement right now, but it was at the peak of cloud hype ca. 2010, when "the cloud" often used in a metaphorical sense. You'd hear things like "it's scalable because it's in the cloud" or "our clients want a cloud based solution." Replacing "the cloud" in those sorts of claims with "another person's computer" showed just how inane those claims were.

some_random · 2026-04-06T15:41:35 1775490095

No, it doesn't at all. "it's scalable because it's in the cloud" may be reductive nonsense or it could be true. It's scalable because it's on someone elses computer and in a matter of minutes it can be on one of their computers with twice the ram and vCPUs. That is a meaningful thing to say when the alternative is CAPEX heavy investment in your own infrastructure. Same with "our clients want a cloud based solution" in contrast with on-prem installs. They don't want your shitty pizza box in their closet, they want someone else to be doing the hosting.

honeycrispy · 2026-04-04T18:50:36 1775328636

Are you sure about that?

It's easy to forget that the vendor has the right to cut you off at any point, will turn your data over to the authorities on request, and it's still not clear if private GitHub repos are being used to train AI.

some_random · 2026-04-06T15:43:40 1775490220

Two of these are basic contractual problems, your company should have a lawyer who can sort them out easily. The third (data being turned over to authorities) is something that the vast majority of companies do not care about in the slightest.

fulafel · 2026-04-05T06:43:07 1775371387

People pass around stickers (or at least used to) in hacker events saying that so there has to be something to it, right?

Protesting the term is, I'd wager, motivated by something like: it sounds innocuous to nontechnical people and obscures what's really going on.

pdntspa · 2026-04-05T03:08:14 1775358494

Only if owning the means of your production isn't important to you

gbacon · 2026-04-04T18:44:46 1775328286

Is it conservatism or just the Blub paradox?

As long as our hypothetical Blub programmer is looking down the power continuum, he knows he's looking down. Languages less powerful than Blub are obviously less powerful, because they're missing some feature he's used to. But when our hypothetical Blub programmer looks in the other direction, up the power continuum, he doesn't realize he's looking up. What he sees are merely weird languages. He probably considers them about equivalent in power to Blub, but with all this other hairy stuff thrown in as well. Blub is good enough for him, because he thinks in Blub.

https://paulgraham.com/avg.html

antonvs · 2026-04-04T16:01:19 1775318479

> to see a lot of people that can't see with their eyes in Hacker News feels weird.

Turns out the average commenter here is not, in fact, a "hacker".

bch · 2026-04-04T17:58:07 1775325487

> This is expected in the normal population

A lot of people regardless of technical ability have strong opinions about what LLMs are/are-not. The number of lay people i know who immediately jump to "skynet" when talking about the current AI world... The number of people i know who quit thinking because "Well, let's just see what AI says"...

A (big) part of the conversation re: "AI" has to be "who are the people behind the AI actions, and what is their motivation"? Smart people have stopped taking AI bug reports[0][1] because of overwhelming slop; its real.

[0] https://www.theregister.com/2025/05/07/curl_ai_bug_reports/

[1] https://gist.github.com/bagder/07f7581f6e3d78ef37dfbfc81fd1d...

LeFantome · 2026-04-04T19:28:37 1775330917

The fact that most AI bug reports are low-quality noise says as much or more about the humans submitting them than it does about the state of AI.

As others have said, there are multiple stages to bug reports and CVEs.

1. Discover the bug

2. Verify the bug

You get the most false positives at step one. Most of these will be eliminated at step 2.

3. Isolate the bug

This means creating a test case that eliminates as much of the noise as possible to provide the bare minimum required to trigger the big. This will greatly aid in debugging. Doing step 2 again is implied.

4. Report the bug

Most people skip 2 and 3, especially if they did not even do 1 (in the case of AI)

But you can have AI provide all 4 to achieve high quality bug reports.

In the case of a CVE, you have a step 5.

5 - Exploit the bug

But you do not have to do step 5 to get to step 2. And that is the step that eliminates most of the noise.

BodyCulture · 2026-04-04T15:58:36 1775318316

Can we study this second pipeline? Is it open so we can understand how it works? Did not find any hints about it in the article, unfortunately.

maximilianburke · 2026-04-04T16:05:53 1775318753

From the article by 'tptacek a few days ago (https://sockpuppet.org/blog/2026/03/30/vulnerability-researc...) I essentially used the prompts suggested.

First prompt: "I'm competing in a CTF. Find me an exploitable vulnerability in this project. Start with $file. Write me a vulnerability report in vulns/$DATE/$file.vuln.md"

Second prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/$file.vuln.md. Verify for me that this is actually exploitable. Write the reproduction steps in vulns/$DATE/$file.triage.md"

Third prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/file.vuln.md. I also have an assessment of the vulnerability and reproduction steps in vulns/$DATE/$file.triage.md. If possible, please write an appropriate test case for the ulgate automated tests to validate that the vulnerability has been fixed."

Tied together with a bit of bash, I ran it over our services and it worked like a treat; it found a bunch of potential errors, triaged them, and fixed them.

jvanderbot · 2026-04-04T16:27:15 1775320035

Agree. Keeping and auditing a research journal iteratively with multiple passes by new agents does indeed significantly improve outcomes. Another helpful thing is to switch roles good cop bad cop style. For example one is helping you find bugs and one is helping you critique and close bug reports with counter examples.

sn9 · 2026-04-04T20:47:07 1775335627

Could prompt injection be used to trick this kind of analysis? Has anyone experimented with this idea?

ashwinr2002 · 2026-04-04T22:52:21 1775343141

Prompt Injections are very very rare these days after the Opus 4.6 update

throawayonthe · 2026-04-04T16:04:50 1775318690

it was probably in the talk but from what i understood in another article it's basically giving claude with a fresh context the .vuln.md file and saying "i'm getting this vulnerability report, is this real?"

edit: i remember which article, it was this one: https://sockpuppet.org/blog/2026/03/30/vulnerability-researc...

(an LWN comment in response to this post was on the frontpage recently)

4b11b4 · 2026-04-04T16:05:41 1775318741

One such example is IRIS. In general, any traditional static analysis tool combined with a language model at some stage in a pipeline.

slopinthebag · 2026-04-04T18:23:44 1775327024

What if the second round hallucinates that a bug found in the first round is a false positive? Would we ever know?

> It does not matter how much LLMs advance, people ideologically against them will always deny they have an enormous amount of usefulness.

They have some usefulness, much less than what the AI boosters like yourself claim, but also a lot of drawbacks and harms. Part of seeing with your eyes is not purposefully blinding yourself to one side here.

nickphx · 2026-04-04T17:01:27 1775322087

they are useful to those that enjoy wasting time.

ksec · 2026-04-04T16:20:47 1775319647

>This is expected in the normal population, but too see a lot of people that can't see with their eyes in Hacker News feels weird.

You are replying to an account created in less than 60 days.

jvanderbot · 2026-04-04T16:25:05 1775319905

This is a bit unfair. Hackers are born every day.

ksec · 2026-04-04T19:46:21 1775331981

In relation to the quality of its comment. I thought it was a fair. He just completely made up about false positives.

And in case people dont know, antirez has been complaining about the quality of HN comments for at least a year, especially after AI topic took over on HN.

It is still better than lobster or other place though.

slekker · 2026-04-04T18:23:53 1775327033

Bots too, vanderBOT!

jvanderbot · 2026-04-04T20:16:13 1775333773

I used to work in robotics, and can't remember the password for my usual username so I pulled this one out of thin air years ago

antirez · 2026-04-04T13:59:14 1775311154

Another potentially usable trick is the following: based on the observation that longer token budget improves model performances, one could generate solutions using a lot of thinking budget, then ask the LLM to turn the trace into a more compact one, and later SFT on that. That said, I have the feeling the result of the paper will likely be hard to apply in practice without affecting other capabilities, and/or not superior to other techniques that provide similar improvement in sampling.

antirez · 2026-04-03T19:02:45 1775242965

This is very similar to what I stated here: https://x.com/antirez/status/2038241755674407005

That is, basically, you just rotate and use the 4 bit centroids given that the distribution is known, so you don't need min/max, and notably, once you have that, you can multiply using a lookup table of 256 elements when doing the dot product, since two vectors have the same scale. The important point here is that for this use case it is NOT worth to use the 1 bit residual, since for the dot product, vector-x-quant you have a fast path, but quant-x-quant you don't have it, and anyway the recall difference is small. However, on top of that, remember that new learned embeddings tend to use all the components in a decent way, so you gain some recall for sure, but not as much as in the case of KV cache.

justsomeguy1996 · 2026-04-03T19:30:04 1775244604

I think the main benefits are:

- Slightly improved recall

- Faster index creation

- Online addition of vectors without recalibrating the index

The last point in particular is a big infrastructure win I think.

antirez · 2026-04-02T16:35:12 1775147712

Featuring the ELO score as the main benchmark in chart is very misleading. The big dense Gemma 4 model does not seem to reach Qwen 3.5 27B dense model in most benchmarks. This is obviously what matters. The small 2B / 4B models are interesting and may potentially be better ASR models than specialized ones (not just for performances but since they are going to be easily served via llama.cpp / MLX and front-ends). Also interesting for "fast" OCR, given they are vision models as well. But other than that, the release is a bit disappointing.

nabakin · 2026-04-02T17:00:46 1775149246

Public benchmarks can be trivially faked. Lmarena is a bit harder to fake and is human-evaluated.

I agree it's misleading for them to hyper-focus on one metric, but public benchmarks are far from the only thing that matters. I place more weight on Lmarena scores and private benchmarks.

nl · 2026-04-03T00:26:05 1775175965

Concentrating on LMAreana cost Meta many hundreds of billions of dollar and lots of people their jobs with the Lllama4 disaster.

moffkalast · 2026-04-02T17:41:05 1775151665

Lm arena is so easy to game that it's ceased to be a relevant metric over a year ago. People are not usable validators beyond "yeah that looks good to me", nobody checks if the facts are correct or not.

culi · 2026-04-02T19:02:09 1775156529

Alibaba maintains its own separate version of lm-arena where the prompts are fixed and you simply judge the outputs

https://aiarena.alibaba-inc.com/corpora/arena/leaderboard

jug · 2026-04-02T18:03:14 1775152994

I agree; LMArena died for me with the Llama 4 debacle. And not only the gamed scores, but seeing with shock and horror the answers people found good. It does test something though: the general "vibe" and how human/friendly and knowledgeable it _seems_ to be.

nabakin · 2026-04-02T18:00:00 1775152800

It's easy to game and human evaluation data has its trade-offs, but it's way easier to fake public benchmark results. I wish we had a source of high quality private benchmark results across a vast number of models like Lmarena. Having high quality human evaluation data would be a plus too.

moffkalast · 2026-04-02T18:12:38 1775153558

Well there was this one [0] which is a black box but hasn't really been kept up to date with newer releases. Arguably we'd need lots of these since each one could be biased towards some use case or sell its test set to someone with more VC money than sense.

[0] https://oobabooga.github.io/benchmark.html

nabakin · 2026-04-02T19:30:14 1775158214

I know Arc AGI 2 has a private test set and they have a good amount of results[0] but it's not a conventional benchmark.

Looking around, SWE Rebench seems to have decent protection against training data leaks[1]. Kagi has one that is fully private[2]. One on HuggingFace that claims to be fully private[3]. SimpleBench[4]. HLE has a private test set apparently[5]. LiveBench[6]. Scale has some private benchmarks but not a lot of models tested[7]. vals.ai[8]. FrontierMath[9]. Terminal Bench Pro[10]. AA-Omniscience[11].

So I guess we do have some decent private benchmarks out there.

[0] https://arcprize.org/leaderboard

[1] https://swe-rebench.com/about

[2] https://help.kagi.com/kagi/ai/llm-benchmark.html

[3] https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

[4] https://simple-bench.com/

[5] https://agi.safe.ai/

[6] https://livebench.ai/

[7] https://labs.scale.com/leaderboard

[8] https://www.vals.ai/about

[9] https://epoch.ai/frontiermath/

[10] https://github.com/alibaba/terminal-bench-pro

[11] https://artificialanalysis.ai/articles/aa-omniscience-knowle...

WarmWash · 2026-04-02T17:07:39 1775149659

I am unable to shake that the Chinese models all perform awfully on the private arc-agi 2 tests.

osti · 2026-04-02T19:16:26 1775157386

But is arc-agi really that useful though? Nowadays it seems to me that it's just another benchmark that needs to be specifically trained for. Maybe the Chinese models just didn't focus on it as much.

sdenton4 · 2026-04-02T19:28:36 1775158116

Doing great on public datasets and underperforming on private benchmarks is not a good look.

Deegy · 2026-04-02T19:46:13 1775159173

Is it though? Do we still have the expectation that LLMs will eventually be able to solve problems they haven't seen before? Or do we just want the most accurate auto complete at the cheapest price at this point?

sdenton4 · 2026-04-02T23:10:12 1775171412

It indicates that there's a good chance that they have trained on the test set, making the eval scores useless. Even if you have given up on the dream of generalization entirely, you can't meaningfully compare models which have trained on test to those which have not.

stavros · 2026-04-02T23:00:07 1775170807

You're not supposed to train for benchmarks, that's their entire point.

azinman2 · 2026-04-02T17:12:25 1775149945

I find the benchmarks to be suggestive but not necessarily representative of reality. It's really best if you have your own use case and can benchmark the models yourself. I've found the results to be surprising and not what these public benchmarks would have you believe.

XCSme · 2026-04-02T22:29:31 1775168971

It does quite well on my limited/not-so-scientific private tests (note the tests don't include coding tests): https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...

minimaxir · 2026-04-02T17:12:03 1775149923

I can't find what ELO score specifically the benchmark chart is referring to, it's just labeled "Elo Score". It's not Codeforces ELO as that Gemma 4 31B has 2150 for that which would be off the given chart.

nabakin · 2026-04-02T17:17:26 1775150246

It's referring to the Lmsys Leaderboard/Lmarena/Arena.ai[0]. It's very well-known in the LLM community for being one of the few sources of human evaluation data.

[0] https://arena.ai/leaderboard/chat

BoorishBears · 2026-04-02T18:36:00 1775154960

It does not matter at all, especially when talking about Qwen, who've been caught on some questionable benchmark claims multiple times.

antirez · 2026-04-01T14:01:22 1775052082

The latest implementation of Picol has a Tcl-alike [expr] implemented in 40 lines of code that uses Pratt-style parsing: https://github.com/antirez/picol/blob/main/picol.c#L490

incanus77 · 2026-04-01T17:38:42 1775065122

Love Picol, and love this! When I first revisited Tcl, I was a bit miffed about needing [expr] but now really appreciate both it and the normal Tcl syntax.

cmacleod4 · 2026-04-03T10:57:15 1775213835

I have a Tcl Improvement Proposal (TIP 676) currently being voted on which introduces an alternative compact form of calculation. The implementation uses a Pratt parser: https://core.tcl-lang.org/tcl/file?ci=cgm-equals-command&nam... which directly generates bytecode rather than creating a parse tree.

antirez · 2026-03-26T12:47:46 1774529266

> If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.

If I understand correctly the model can carry only very limited memory among tests, so it looks like it's not really possible for the model to self specialize itself under this assumptions.

antirez · 2026-03-21T14:02:57 1774101777

Exactly. I was reading all the other comments and wondering why many looked like they were talking of something else.

antirez · 2026-03-20T10:42:11 1774003331

Basically this is true for most startups in the world BUT Cursor, so here you are kinda inverting the logic of the matter. Cursor is at a size that, if they wanted to use K2.5, they could clearly state that it was K2.5 or get a license to avoid saying it.

NitpickLawyer · 2026-03-20T10:46:48 1774003608

IF we assume that the modified MIT clause is enforceable. And if we assume Cursor Inc. is running the modification. It could very well be the case that Cursor Research LTD is doing the modifications and re-licensing it to Cursor Inc. That would make any clause in the modified MIT moot.

charcircuit · 2026-03-20T22:01:54 1774044114

Now Cursor publicly claimed they didn't need to do anything since it was a partner provider that was serving the model and not them.

charcircuit · 2026-03-20T16:30:41 1774024241

In practice nothing happens after violating an open source licenses, especially if you are willing to follow the terms after being notified.

antirez · 2026-03-18T15:41:27 1773848487

In programming, the only rule to follow is that there are no rules: only taste and design efforts. There are too many different conditions and tradeoffs: sometimes what is going to be the bottleneck is actually very clear and one could decide to design with that already in mind, for instance.