Similar experience. I use these AI tools on a daily basis. I have tons of examples like yours. In one recent instance I explicitly told it in the prompt to not use memcpy, and it used memcpy anyway, and generated a 30-line diff after thinking for 20 minutes. In that amount of time I created a 10-line diff that didn't use memcpy.
I think it's the big investors' extremely powerful incentives manifesting in the form of internet comments. The pace of improvement peaked at GPT-4. There is value in autocomplete-as-a-service, and the "harnesses" like Codex take it a lot farther. But the people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away. This is not a hockey stick curve. It's a log curve.
Bigger context windows are a welcome addition. And stuff like JSON inputs is nice too. But these things aren't gonna like, take your SWE job, if you're any good. It's just like, a nice substitute for the Google -> Stack Overflow -> Copy/Paste workflow.
Most devs aren't very good. That's the reality, it's what we've all known for a long time. AI is trained on their code, and so these "subpar" devs are blown away when they see the AI generate boring, subpar code.
The second you throw a novel constraint into the mix things fall apart. But most devs don't even know about novel constraints let alone work with them. So they don't see these limitations.
Ask an LLM to not allocate? To not acquire locks? To ensure reentrancy safety? It'll fail - it isn't trained on how to do that. Ask it to "rank" software by some metric? It ends up just spitting out "community consensus" because domain expertise won't be highly represented in its training set.
I love having an LLM to automate the boring work, to do the "subpar" stuff, but they have routinely failed at doing anything I consider to be within my core competency. Just yesterday I used Opus 4.6 to test it out. I checked out an old version of a codebase that was built in a way that is totally inappropriate for security. I asked it to evaluate the system. It did far better than older models but it still completely failed in this task, radically underestimating the severity of its findings, and giving false justifications. Why? For the very obvious reason that it can't be trained to do that work.
The people glazing these tools can't design systems. I have this founder friend who I've known for decades, he knows how to code but he isn't really interested in it; he's more interested in the business side and mostly sees programming as a way to make money. Before ChatGPT he would raise money and hire engineers ASAP. When not a founder he would try to get into management roles etc etc. About a year ago he told me he doesn't really write code anymore, and he showed me part of his codebase for this new company he's building. To my horror I saw a 500-line bash script that he claimed he did not understand and just used prompts to edit it.
It didn't need to be a bash script. It could have been written in any scripting language. I presume it started off as a bash script because that's probably what it started out as when he was exploring the idea. And since it was already bash I guess he decided to just keep going with it. But it was just one of those things where I was like, these autocomplete services would never stop and tell you "maybe this 500-line script should be rewritten in python", it will just continue to affirm you, and pile onto the tech debt.
I used to freak out and think my days were numbered when people claimed they stopped writing code. But now I realize that they don't like writing code, don't care about getting better at it, don't know what good code looks like, and would hire an engineer if they could. With that framing, whenever I see someone say "Opus 4.6 is nuts. Everything I throw at it works. Frontend, backend, algorithms—it does not matter." I know for a fact that "everything" in that persons mind is very limited in scope.
Also, I just realized that there was an em-dash in that comment. So there's that. Wasn't even written by a person.
> I know for a fact that "everything" in that persons mind is very limited in scope.
I agree, and I think it's quite telling what people are impressed by. Someone elsewhere said that Opus 4.6 is a better programmer than they are and... I mean, I kinda believe it, but I think it says way more about them than it does about Opus.
> people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away
Careful, or you're going to get slapped by the stupid astroturfing rule... but you're correct. Also there's the sunk cost fallacy, post purchase rationalization, choice supportive bias, hell look at r/MyBoyfriendIsAI... some people get very attached to these bots, they're like their work buddies or pets, so you don't even need to pay them, they'll glaze the crap out it themselves.
I think it's the big investors' extremely powerful incentives manifesting in the form of internet comments. The pace of improvement peaked at GPT-4. There is value in autocomplete-as-a-service, and the "harnesses" like Codex take it a lot farther. But the people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away. This is not a hockey stick curve. It's a log curve.
Bigger context windows are a welcome addition. And stuff like JSON inputs is nice too. But these things aren't gonna like, take your SWE job, if you're any good. It's just like, a nice substitute for the Google -> Stack Overflow -> Copy/Paste workflow.