Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If true this is very nice incremental improvement. It looks like it doesn't meaningfully improve the capabilities of the model, but is cheaper to compute than RMSNorm (which essentially all current state of art LLMs use) which means faster/cheaper training.


RMSNorm is pretty insigificant in terms of the overall compute in a transformer though -- usually the reduction work can be fused with earlier or later operations.


Rmsnorm acts like a barrier. No compute on the next network layer can start before all compute in the previous layer is done.

Splitting networks across multiple GPU's, this means you must wait for the slowest node and the longest latency.

As soon as you can remove most of these barriers, compute over non-latency-guaranteed networks becomes more practical, as does non-homogeneous compute (ie. Mixing different GPU models).


What are other barriers in transformers? Or is the normalization layer the primary one?


dot-product attention is the biggest barrier. This is why there are so many attempts to linearize it.


that fail... linearization is a bad idea. But plenty of other optimizations are done


The paper's Table 7 shows DyT reducing overall LLaMA 7B inference time by 7.8% and training time by 8.2%. That is not insignificant.


But LLM performance scales according to the log of compute, so yeah it’s pretty insignificant. I think we’ve reached a bit of a plateau.


Okay, I just tried this on my pet transformer training benchmark and the results are very disappointing; it converges much more slowly than just using RMSNorm.

It either needs some significant hyperparameter tuning (besides tweaking alpha, which doesn't seem to do much for me), or some fancier initialization (tried both pytorch default and orthogonal, no difference), or maybe my scalar optimizer doesn't work on it (I have a custom optimizer for scalars which speeds up convergence vs Adam, but for DyT layers it seems to be just as good as Adam), or maybe it only catches up after billions of tokens (which I don't have the budget to test for so long).


Slight update, more fancy initialization of DyT weights (instead of having them be ones) seems to help a lot in my case (although it's still not as good as just using RMSNorm). Do something like this on the very first training step (`x` is the input to the layer):

    y = x.to(torch.float32)
    y = y * torch.rsqrt(y.pow(2).mean(-1, keepdim=True) + 1e-6)
    z = torch.tanh(self.alpha * x)
    scale = (y / (z + 1e-6)).mean(dim = -2).flatten()
    self.weight.detach().copy_(scale)
This basically tries to initialize the weights so that the output of DyT is closer to what RMSNorm would have outputted, and it seems to help.


Which model are you training and on what dataset?


It's a fully custom architecture, heavily inspired by the modded-nanogpt speedrun (https://github.com/KellerJordan/modded-nanogpt) but written fully from scratch and further tweaked/modified. I use it for experiments and as a testbed when developing my training harness (which I use for training other models too, and which receives all of my non-LLM-specific improvements like e.g. better than Adam optimizers, a custom GPU memory allocator, custom gradient accumulation that accumulates directly into the optimizers' state without using extra VRAM for gradient, etc.).

For the dataset I just use FineWeb-Edu.


Wow, thank you for the link to the code - I haven't seen it before - it contains a ton of useful tricks. Lots to learn from there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: