Nah, in my experience, if there is the slightest error in the first sentence of the chain of thought, it tends to get worse and worse. I've had prompts that would generate a reasonable response in llama, but turn out utter garbage in Deepthink.
But how is this any different from real humans? They are not always right either. Sure, humans can understand things better, but are we really going to act like LLMs can't get better in the next year? And what about the next 6 months? I bet there are unknown startups like Deepseek that can push the frontier further.
The ways in which humans err are very different. You have a sense of your own knowledge on a topic and if you start to stray from what you know you're aware of it. Sure, you can lie about it but you have inherent confidence levels in what you're doing.
Sure, LLMs can improve but they're ultimately still bound by the constraints of the type of data they're trained on and don't actually build world models through a combination of high bandwidth exploratory training (like humans) and repeated causal inference.