They haven't improved at all since then in my experience. I poke at them every now and then and they still can't refrain from feeding me false info (and likely never will be able to, because they are stochastic parrots without any actual understanding). They are useless to me because I will take more time checking their output than I will just doing the task myself.
> and they still can't refrain from feeding me false info
If that's your metric, and even then only if you've got a boolean yes/no measurement, then I agree.
If you measure "false info" as a percent, they're better. If you measure scores on IQ tests, on general knowledge, on exams, on the size of a code problem they can write before they have a 20% chance of failure, on the quality of translations they make, on the new modalities like being able to both consume and respond with images, on mathematical olympiad questions, then they're significantly better.
Unfortunately, we can tell by the general public reaction (not just you) that even all those things combined still don't fully capture what normal people mean by "intelligence".
> They are useless to me because I will take more time checking their output than I will just doing the task myself.
What size problem do you give them? I use them in software, and try to keep each single task I give them to ones which would take a human 90 minutes. I can check the quality of an attempt at a human-would-take-90-minutes-to-do task in about 5 minutes.
When I've accidentally let an LLM do bigger tasks than that, then the difficulty of checking goes way up and the quality of the output goes way down.
Conveniently, one of the tasks that generally takes a human less than 90 minutes is breaking down a bigger task in to sub-tasks that themselves take less than 90 minutes. Fail do do this and I get exactly what you experience.
No, they absolutely know. They've been very very slowly migrating stuff over to the new Settings panel bit by bit. If you look at what's in Control Panel now, it's maybe half as much as what used to be in there ten years ago.
That said, it's insanely ridiculous that it's taken 10 years to get it even halfway done.
Why does that matter though? MCP is meant for LLMs not humans, and for something like this lib it seems the human side if the API is based on JavaScript not JSON.
reply