Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.

This has been the bottleneck in every ML (not just text/LLM) project I’ve been part of.

Not finding the right AI engineers. Not getting the MLops textbook perfect using the latest trends.

It’s the collecting enough high quality data and getting it properly annotated and verified. Then doing proper evals with humans in the loop to get it right.

People who only know these projects through headlines and podcasts really don’t like to accept this idea. Everyone wants synthetic data with LLMs doing the annotations and evals because they’ve been sold this idea that the AI will do everything for you, you just need to use it right. Then layer on top of that the idea that the LLMs can also write the code for you and it’s a mess when you have to deal with people who only gain their AI knowledge through headlines, LinkedIn posts, and podcasts.



Amen brother. Working on a computer vision project right now, it's a wild success.

This isn't my first CV project, but it's the most successful one. And that chiefly because my client pulled out their wallets and let an army of annotators create all the train data I asked for, and more.


This has been the huge problem in AI research since at least 1998 (and that was just when I was first exposed to it). With data, everything is so much easier, and much simpler machine learning methods.

Supervised learning. Took a while to make that work well.

And then every few years someone comes up with a way to distill data out of unsupervised examples. GPT is these days the big example of that, but there was "ImageNet (unlabeled)" and LAION before that too. The issue is that there is just so much unsupervised data.

Now LLMs use that pretty well (even though stuffing everything into an LLM is getting old, and as this article points out, in any specific application they tend to get bested by something like XGBoost with very simple models)

The next frontier is probably "world models", where you first train unsupervised, not to train your model but to predict the world. THEN you train the model in this simulated, predicted world. That's the reason Yann Lecun really really wants to go down this direction.


> Now LLMs use that pretty well (even though stuffing everything into an LLM is getting old, and as this article points out, in any specific application they tend to get bested by something like XGBoost with very simple models)

You can't blame the users for that though, for instance, OpenAI's ChatGPT uses 'Ask Anything' as their home page prompt. Zero specialization, expert at anything. And people totally believe it.


I’ve got no problem w/ synthetic data, but it is still more work that most people want to do.


There was a post on here recently about how you should build your own agent, and I completely agree. I'd say most competent developers should be building even more complex projects than an agent. Once you do you quickly realize how it's a constant uphill battle, and it quickly becomes apparent that the data you're working with is the primary issue.


I don't know if that is what gp and above is talking about. "Agents" are the kind of thing/word that helps to paper over the very fact that these things only work because of huge amount of humans in-the-loop in the outset (that is, you know, labor). Agents help us believe that LLM's can do everything for us, even bootstrap themselves, but, what the above thread is about is that, really, what you get out correlates only to what you put in in the first place.


> Agents help us believe that LLM's can do everything for us, even bootstrap themselves

Having the agent, and treating it carelessly, helps one believe this.

Making it is another story.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: