If true this is very nice incremental improvement. It looks like it doesn't meaningfully improve the capabilities of the model, but is cheaper to compute than RMSNorm (which essentially all current state of art LLMs use) which means faster/cheaper training.
RMSNorm is pretty insigificant in terms of the overall compute in a transformer though -- usually the reduction work can be fused with earlier or later operations.
Rmsnorm acts like a barrier. No compute on the next network layer can start before all compute in the previous layer is done.
Splitting networks across multiple GPU's, this means you must wait for the slowest node and the longest latency.
As soon as you can remove most of these barriers, compute over non-latency-guaranteed networks becomes more practical, as does non-homogeneous compute (ie. Mixing different GPU models).
Okay, I just tried this on my pet transformer training benchmark and the results are very disappointing; it converges much more slowly than just using RMSNorm.
It either needs some significant hyperparameter tuning (besides tweaking alpha, which doesn't seem to do much for me), or some fancier initialization (tried both pytorch default and orthogonal, no difference), or maybe my scalar optimizer doesn't work on it (I have a custom optimizer for scalars which speeds up convergence vs Adam, but for DyT layers it seems to be just as good as Adam), or maybe it only catches up after billions of tokens (which I don't have the budget to test for so long).
Slight update, more fancy initialization of DyT weights (instead of having them be ones) seems to help a lot in my case (although it's still not as good as just using RMSNorm). Do something like this on the very first training step (`x` is the input to the layer):
y = x.to(torch.float32)
y = y * torch.rsqrt(y.pow(2).mean(-1, keepdim=True) + 1e-6)
z = torch.tanh(self.alpha * x)
scale = (y / (z + 1e-6)).mean(dim = -2).flatten()
self.weight.detach().copy_(scale)
This basically tries to initialize the weights so that the output of DyT is closer to what RMSNorm would have outputted, and it seems to help.
It's a fully custom architecture, heavily inspired by the modded-nanogpt speedrun (https://github.com/KellerJordan/modded-nanogpt) but written fully from scratch and further tweaked/modified. I use it for experiments and as a testbed when developing my training harness (which I use for training other models too, and which receives all of my non-LLM-specific improvements like e.g. better than Adam optimizers, a custom GPU memory allocator, custom gradient accumulation that accumulates directly into the optimizers' state without using extra VRAM for gradient, etc.).