Hacker Newsnew | past | comments | ask | show | jobs | submit | practice9's commentslogin

LLMs are getting quite good at reviewing the results and implementations, though


Not really, they're only as good as their context and they do miss and forget important things. It doesn't matter how often, because they do, and they will tell you with 100% confidence and with every synonym of "sure" that they caught it all. That's the issue.


I am very confident that these tools are better than the median programmer at code review now. They are certainly much more diligent. An actually useful standard to compare them to is human review, and for technical problems, they definitely pass it. That said, they’re still not great at giving design feedback.

But GPT-5 Pro, and to a certain extent GPT-5 Codex, can spot complex bugs like race conditions, or subtly incorrect logic like memory misuse in C, remarkably well. It is a shame GPT-5 Pro is locked behind a $200/month subscription, which means most people do not understand just how good the frontier models are at this type of task now.


I find it hilarious/sad that the 0.5x cheaper Ergo M575 has much better design in that regard (just plastic that doesn’t degrade)


They should have used Claude Code for reviews


Humans cannot reason about code at scale. Unless you add scaffolding like diagrams and maps and …

Things that most teams don’t do or half-ass


Its not scaffolding if the intelligence itself is adding it. Humans can make their own diagrams ajd maps to help them, LLM agentsbneed humans to scaffold for them, thats the setup for the bitter lesson


A variation of “no taxation without representation”?


The human is a bad co-author here really.

I deployed lots of high performance, clean, well documented etc code generated by Claude or o3. I reviewed it wrt requirements, added tests and so on. Even with that in mind it allowed me to work 3x faster.

But it required conscious effort on my part to point out issues and inefficiencies on LLMs part.

It is a collaborative type of work where LLMs shine (even in so called agentic flows)


Well the system prompt is still the same for both models, right?

Kinda points to people at OpenAI using o1/o3/o4 almost exclusively.

That's why nobody noticed how cringe 4o has become


They have different uses. The reasoning models aren't good at multi-turn conversations.

"GPT-4.5" is the best at conversations IMO, but it's slow. It's a lot lazier than o4 though; it likes giving brief overview answers when you want specifics.


people at OAI definitely use AVM which is 4o-based, at least


Kinda similar in a way to China or Russia “disappearances”


But who is the target group?

Last time only some groups of enthusiasts were willing to work through bugs to even run the buggy release of Gemma

Surely nobody runs this in production


The press and decision makers without technical knowledge are the target group, it doesn’t matter if it’s used in production or not. They need a locally deployable model to keep up with enterprises to risk averse to put their data into the cloud and also don’t care that their shitty homegrown ChatGPT replacement barely works. It’s a checkbox.


I tried the square example from the paper mentioned with o1-pro and it had no problem counting 4 nested squares…

And the 5 square variation as well.

So perhaps it is just a question of how much compute you are willing to throw at it


yea seems like o1-pro was able to solve a few of those variations in the paper we referenced. take a look at the rest of the examples and let me know! the paper’s (somewhat) old at this point, but if you try our more complex square variations with overlapping edges and varying line thicknesses, the same issues arise. although I generally agree a mind boggling amount of compute will increase accuracy for sure.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: