You should consider the size of the training set as well. A blog post in a 30T d...

You should consider the size of the training set as well. A blog post in a 30T dataset like RedPajama has less impact than one in a fine-tuning dataset of 1000 examples. The gradients from all tokens are added up, they stack on top of each other and their influence is diluted in larger datasets.