Hierarchical transformers are more efficient language models

goldemerald · on Nov 4, 2021

This work is quite interesting, and I'm always happy to see such large improvements for memory usage and computation over existing high performant models. I imagine transformer-based models will be with the ML community for a very long time (like ResNets).

I will say that it is somewhat comical that an already existing paper titled "Hierarchical Transformers for Long Document Classification" isn't mentioned in the related work. But to be fair, the two papers are only modestly similar.

oofbey · on Nov 5, 2021

I agree that Transformers are here to stay. The basic building block (self-attention layers) seem to me like the "new fully-connected layer" - the natural way to connect layers and build a deep net. Except that with FC-layers the activations can only be fixed-sized vectors. But the self-attention, each layer can have a variable-sized bag of vectors, and you just need to encode their relationship to each other somehow. This is clearly successful for text using spectral positional encoding. It's starting to work for images, with 2D positional encoding. There's every reason to think it will work for many other data types.

It seems to me the key barrier is the high computational overhead for self-attention. But in highly-parallel vector-math world (GPUs, TPUs, NPUs, etc) Moore's law marches on, with little end in sight, because parallelism works great. That said, making them more efficient, like this paper, will certainly help their adoption.

sjg007 · on Nov 5, 2021

They work for everything.

canjobear · on Nov 5, 2021

The models in the two papers are pretty different. If I were a reviewer I wouldn't expect TFA to cite that paper.

briga · on Nov 5, 2021

In their defense this field of study moves incredibly fast—there are hundreds of papers published every week. It would be a full-time job to simply keep up-to-date with the literature, to say nothing of doing your own research, so it’s hardly surprising the authors might not mention a given paper.