I agree that Transformers are here to stay. The basic building block (self-atten...

I agree that Transformers are here to stay. The basic building block (self-attention layers) seem to me like the "new fully-connected layer" - the natural way to connect layers and build a deep net. Except that with FC-layers the activations can only be fixed-sized vectors. But the self-attention, each layer can have a variable-sized bag of vectors, and you just need to encode their relationship to each other somehow. This is clearly successful for text using spectral positional encoding. It's starting to work for images, with 2D positional encoding. There's every reason to think it will work for many other data types.

It seems to me the key barrier is the high computational overhead for self-attention. But in highly-parallel vector-math world (GPUs, TPUs, NPUs, etc) Moore's law marches on, with little end in sight, because parallelism works great. That said, making them more efficient, like this paper, will certainly help their adoption.