Thank you for detailed and easy-to-understand explanations. This is why I come t...

dnautics · on Dec 4, 2020

PDBs are not huge files, they are human-readable text format. Here is a rather large protein clocking in at kilobytes: https://files.rcsb.org/download/2RIK.pdb

Also I am not in ML research, but IIRC the spooky/counterintuitive nature of high-dimensional gradient-based search is that you can get better results from less data by increasing the number of parameters, as long as you have a sane set of regularization techniques (or have we even ditched those too?).

fpgaminer · on Dec 4, 2020

> as long as you have a sane set of regularization techniques (or have we even ditched those too?).

Maybe. In the context of natural language at least, Transformers require less and less data to reach the same result as you increase the number of parameters. No regularization needed. See Figure 2 in the paper Scaling Laws for Neural Language Models (2001.08361).

It's quite odd. Who knows if that will hold for other domains, like protein folding. It may very well be the case though, since AFAIK DeepMind's folding model used attention to reach these landmark results.

rprenger · on Dec 4, 2020

In figure 2, those curves are all from one epoch each right? Which is to say the model has never seen the same data twice?

It's true you don't need regularization if you've never seen the same data twice, but that's a similar regularization to early stopping. You'd expect the larger number of parameters would make the training error drop faster with fewer tokens as well due to improved optimizability. But rather than "larger models need less data", I'd say the take away is more "larger models need fewer steps to optimize training error". None of the models get "good" until they've seen a number of tokens similar to the number of parameters.

Unless you've run many epochs on small data sets and seen the same results, in which case that's pretty weird/cool.

martingoodson · on Dec 5, 2020

Figure 4 is for dataset size and seems to shows the same thing.

rprenger · on Dec 7, 2020

Yeah, figure 4 is more clear. This is the early stopped loss though, so the regularization is more explicit. If you trained the large models to completion on small data sets they would do much worse on test error due to overfitting.

It is interesting that larger models with regularization (early stopping) seems to work better than than training smaller models to convergence though.

dnautics · on Dec 4, 2020

I remember reading somewhere that this was likely to be the consequence of how distance metrics work in high-dimensional manifolds.

ArnoVW · on Dec 4, 2020

My understand, based only on having watched Yannic Kilcher's video, is that they had in the order of 10k training cases. So not a lot.

But they added a ton of highly domain specific features. And since it's essentially a natural phenomenon, I would expect the signal to be relatively strong.

I recommend the video, or any video from Yannic:

  https://youtu.be/B9PL__gVxLI

cambalache · on Dec 4, 2020

Be careful on who you trust. The author of the article is a professor in structural biology (the field on protein structure) at Imperial College in London. The article is well written and the criticisms are fair and carefully stated. Dont come to HN to get scientific opinions. This site is pretty mediocre at that. There are exceptions of course.

dnautics · on Dec 4, 2020

sure but also keep in mind that a professor in structural biology is not necessarily a consumer of structural biology (versus me, plus I have citations proving that I have consumed them) ;)

lucidrains · on Dec 4, 2020

"It is difficult to get a man to understand something, when his salary depends upon his not understanding it."

czzr · on Dec 4, 2020

Talking about the professor or the commenter?

dnautics · on Dec 4, 2020

The AI that I currently help orchestrate is not in the sciences. I do plan to go back into biotech someday, but not in anything that requires this in a major way (the intended disruption is socioeconomic, not technological)

musicale · on Dec 5, 2020

Pretty sure Prof. Stephen Curry could crush most of his critics in 1-on-1 basketball as well.

spullara · on Dec 4, 2020

I think they said they used 140k structures as their training data.