Very cool idea. Interested to see how this progresses.
One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.
yes, good point. right now, it's somewhat hard to overfit because the meta-optimization extracts tiny bits of information. but over time, we will switch the validation set to some other random subset of the FineWeb or even entirely OOD datasets!
The question is not if but when. I hope the project authors acknowledge the problem directly: it is not merely a risk; it is a statistical certainty given enough time. So, what's the plan?
At the very least, track it. How will the project maintainers instrument this?