Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You should consider the size of the training set as well. A blog post in a 30T dataset like RedPajama has less impact than one in a fine-tuning dataset of 1000 examples. The gradients from all tokens are added up, they stack on top of each other and their influence is diluted in larger datasets.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: