> Thing is that to know it is good enough you still have to collect and annotate...

isoprophlex · 2025-11-14T16:30:33 1763137833

Amen brother. Working on a computer vision project right now, it's a wild success.

This isn't my first CV project, but it's the most successful one. And that chiefly because my client pulled out their wallets and let an army of annotators create all the train data I asked for, and more.

spwa4 · 2025-11-14T17:40:34 1763142034

This has been the huge problem in AI research since at least 1998 (and that was just when I was first exposed to it). With data, everything is so much easier, and much simpler machine learning methods.

Supervised learning. Took a while to make that work well.

And then every few years someone comes up with a way to distill data out of unsupervised examples. GPT is these days the big example of that, but there was "ImageNet (unlabeled)" and LAION before that too. The issue is that there is just so much unsupervised data.

Now LLMs use that pretty well (even though stuffing everything into an LLM is getting old, and as this article points out, in any specific application they tend to get bested by something like XGBoost with very simple models)

The next frontier is probably "world models", where you first train unsupervised, not to train your model but to predict the world. THEN you train the model in this simulated, predicted world. That's the reason Yann Lecun really really wants to go down this direction.

jacquesm · 2025-11-15T04:40:42 1763181642

> Now LLMs use that pretty well (even though stuffing everything into an LLM is getting old, and as this article points out, in any specific application they tend to get bested by something like XGBoost with very simple models)

You can't blame the users for that though, for instance, OpenAI's ChatGPT uses 'Ask Anything' as their home page prompt. Zero specialization, expert at anything. And people totally believe it.

PaulHoule · 2025-11-14T21:55:02 1763157302

I’ve got no problem w/ synthetic data, but it is still more work that most people want to do.

richardlblair · 2025-11-14T17:30:40 1763141440

There was a post on here recently about how you should build your own agent, and I completely agree. I'd say most competent developers should be building even more complex projects than an agent. Once you do you quickly realize how it's a constant uphill battle, and it quickly becomes apparent that the data you're working with is the primary issue.

beepbooptheory · 2025-11-14T17:46:11 1763142371

I don't know if that is what gp and above is talking about. "Agents" are the kind of thing/word that helps to paper over the very fact that these things only work because of huge amount of humans in-the-loop in the outset (that is, you know, labor). Agents help us believe that LLM's can do everything for us, even bootstrap themselves, but, what the above thread is about is that, really, what you get out correlates only to what you put in in the first place.

zahlman · 2025-11-15T03:43:03 1763178183

> Agents help us believe that LLM's can do everything for us, even bootstrap themselves

Having the agent, and treating it carelessly, helps one believe this.

Making it is another story.