Thank you for detailed and easy-to-understand explanations. This is why I come to HN.
As someone who used to work in adjacent field, I have a question that I hope you can help me with. DL usually requires a large volume of data, but I don't know whether there is a huge pile of experimentally determined protein structures for training. Did DeepMind find a clever method to get around data-size issue or there is really a lot of known protein structures? I skimmed one of their earlier papers and it seems training data was in tens of thousands. I am surprised that is enough for training DL structure prediction.
Also I am not in ML research, but IIRC the spooky/counterintuitive nature of high-dimensional gradient-based search is that you can get better results from less data by increasing the number of parameters, as long as you have a sane set of regularization techniques (or have we even ditched those too?).
> as long as you have a sane set of regularization techniques (or have we even ditched those too?).
Maybe. In the context of natural language at least, Transformers require less and less data to reach the same result as you increase the number of parameters. No regularization needed. See Figure 2 in the paper Scaling Laws for Neural Language Models (2001.08361).
It's quite odd. Who knows if that will hold for other domains, like protein folding. It may very well be the case though, since AFAIK DeepMind's folding model used attention to reach these landmark results.
In figure 2, those curves are all from one epoch each right? Which is to say the model has never seen the same data twice?
It's true you don't need regularization if you've never seen the same data twice, but that's a similar regularization to early stopping. You'd expect the larger number of parameters would make the training error drop faster with fewer tokens as well due to improved optimizability. But rather than "larger models need less data", I'd say the take away is more "larger models need fewer steps to optimize training error". None of the models get "good" until they've seen a number of tokens similar to the number of parameters.
Unless you've run many epochs on small data sets and seen the same results, in which case that's pretty weird/cool.
Yeah, figure 4 is more clear. This is the early stopped loss though, so the regularization is more explicit. If you trained the large models to completion on small data sets they would do much worse on test error due to overfitting.
It is interesting that larger models with regularization (early stopping) seems to work better than than training smaller models to convergence though.
My understand, based only on having watched Yannic Kilcher's video, is that they had in the order of 10k training cases. So not a lot.
But they added a ton of highly domain specific features. And since it's essentially a natural phenomenon, I would expect the signal to be relatively strong.
Be careful on who you trust. The author of the article is a professor in structural biology (the field on protein structure) at Imperial College in London. The article is well written and the criticisms are fair and carefully stated. Dont come to HN to get scientific opinions. This site is pretty mediocre at that. There are exceptions of course.
sure but also keep in mind that a professor in structural biology is not necessarily a consumer of structural biology (versus me, plus I have citations proving that I have consumed them) ;)
The AI that I currently help orchestrate is not in the sciences. I do plan to go back into biotech someday, but not in anything that requires this in a major way (the intended disruption is socioeconomic, not technological)
As someone who used to work in adjacent field, I have a question that I hope you can help me with. DL usually requires a large volume of data, but I don't know whether there is a huge pile of experimentally determined protein structures for training. Did DeepMind find a clever method to get around data-size issue or there is really a lot of known protein structures? I skimmed one of their earlier papers and it seems training data was in tens of thousands. I am surprised that is enough for training DL structure prediction.