If true this is very nice incremental improvement. It looks like it doesn't mean...

rryan · on March 15, 2025

RMSNorm is pretty insigificant in terms of the overall compute in a transformer though -- usually the reduction work can be fused with earlier or later operations.

londons_explore · on March 15, 2025

Rmsnorm acts like a barrier. No compute on the next network layer can start before all compute in the previous layer is done.

Splitting networks across multiple GPU's, this means you must wait for the slowest node and the longest latency.

As soon as you can remove most of these barriers, compute over non-latency-guaranteed networks becomes more practical, as does non-homogeneous compute (ie. Mixing different GPU models).

elcritch · on March 15, 2025

What are other barriers in transformers? Or is the normalization layer the primary one?

woadwarrior01 · on March 15, 2025

dot-product attention is the biggest barrier. This is why there are so many attempts to linearize it.

amitport · on March 15, 2025

that fail... linearization is a bad idea. But plenty of other optimizations are done

atgctg · on March 15, 2025

The paper's Table 7 shows DyT reducing overall LLaMA 7B inference time by 7.8% and training time by 8.2%. That is not insignificant.

Herring · on March 15, 2025

But LLM performance scales according to the log of compute, so yeah it’s pretty insignificant. I think we’ve reached a bit of a plateau.

kouteiheika · on March 15, 2025

Okay, I just tried this on my pet transformer training benchmark and the results are very disappointing; it converges much more slowly than just using RMSNorm.

It either needs some significant hyperparameter tuning (besides tweaking alpha, which doesn't seem to do much for me), or some fancier initialization (tried both pytorch default and orthogonal, no difference), or maybe my scalar optimizer doesn't work on it (I have a custom optimizer for scalars which speeds up convergence vs Adam, but for DyT layers it seems to be just as good as Adam), or maybe it only catches up after billions of tokens (which I don't have the budget to test for so long).

kouteiheika · on March 16, 2025

Slight update, more fancy initialization of DyT weights (instead of having them be ones) seems to help a lot in my case (although it's still not as good as just using RMSNorm). Do something like this on the very first training step (`x` is the input to the layer):

    y = x.to(torch.float32)
    y = y * torch.rsqrt(y.pow(2).mean(-1, keepdim=True) + 1e-6)
    z = torch.tanh(self.alpha * x)
    scale = (y / (z + 1e-6)).mean(dim = -2).flatten()
    self.weight.detach().copy_(scale)

This basically tries to initialize the weights so that the output of DyT is closer to what RMSNorm would have outputted, and it seems to help.

kadushka · on March 15, 2025

Which model are you training and on what dataset?

kouteiheika · on March 15, 2025

It's a fully custom architecture, heavily inspired by the modded-nanogpt speedrun (https://github.com/KellerJordan/modded-nanogpt) but written fully from scratch and further tweaked/modified. I use it for experiments and as a testbed when developing my training harness (which I use for training other models too, and which receives all of my non-LLM-specific improvements like e.g. better than Adam optimizers, a custom GPU memory allocator, custom gradient accumulation that accumulates directly into the optimizers' state without using extra VRAM for gradient, etc.).

For the dataset I just use FineWeb-Edu.

kadushka · on March 16, 2025

Wow, thank you for the link to the code - I haven't seen it before - it contains a ton of useful tricks. Lots to learn from there.