Not really, they're only as good as their context and they do miss and forget important things. It doesn't matter how often, because they do, and they will tell you with 100% confidence and with every synonym of "sure" that they caught it all. That's the issue.
I am very confident that these tools are better than the median programmer at code review now. They are certainly much more diligent. An actually useful standard to compare them to is human review, and for technical problems, they definitely pass it. That said, they’re still not great at giving design feedback.
But GPT-5 Pro, and to a certain extent GPT-5 Codex, can spot complex bugs like race conditions, or subtly incorrect logic like memory misuse in C, remarkably well. It is a shame GPT-5 Pro is locked behind a $200/month subscription, which means most people do not understand just how good the frontier models are at this type of task now.
Its not scaffolding if the intelligence itself is adding it. Humans can make their own diagrams ajd maps to help them, LLM agentsbneed humans to scaffold for them, thats the setup for the bitter lesson
I deployed lots of high performance, clean, well documented etc code generated by Claude or o3. I reviewed it wrt requirements, added tests and so on. Even with that in mind it allowed me to work 3x faster.
But it required conscious effort on my part to point out issues and inefficiencies on LLMs part.
It is a collaborative type of work where LLMs shine (even in so called agentic flows)
They have different uses. The reasoning models aren't good at multi-turn conversations.
"GPT-4.5" is the best at conversations IMO, but it's slow. It's a lot lazier than o4 though; it likes giving brief overview answers when you want specifics.
The press and decision makers without technical knowledge are the target group, it doesn’t matter if it’s used in production or not. They need a locally deployable model to keep up with enterprises to risk averse to put their data into the cloud and also don’t care that their shitty homegrown ChatGPT replacement barely works. It’s a checkbox.
yea seems like o1-pro was able to solve a few of those variations in the paper we referenced. take a look at the rest of the examples and let me know! the paper’s (somewhat) old at this point, but if you try our more complex square variations with overlapping edges and varying line thicknesses, the same issues arise. although I generally agree a mind boggling amount of compute will increase accuracy for sure.