Funnily enough, I'm currently trying to make my way through a preprint showing that models of dense associate memory with bipartite structure, including Transformers (!!!), are a special case of a more general routing algorithm that implements a "block of agents" in a differentiable model of Minsky's Society of Mind: https://arxiv.org/abs/2211.11754. Maybe "symbolic" and "connectionist" AI are two sides of the same coin?
EDIT: I feel compelled to mention that the efficient implementation of that more general routing algorithm can handle input sequences with more than 1M token embeddings in a single GPU, which quite frankly seems like it should be impossible but somehow it works: https://github.com/glassroom/heinsen_routing#routing-very-lo....
For those reading this after-the-fact, this thread started as a response to eigenvalue's original headline, a question, which was subsequently moved to a separate comment: https://news.ycombinator.com/item?id=34110868
Oh I see, thanks! Interesting. It sounds like this is some kind of dynamic attention, as opposed to static attention in transformers, where queries and key don’t change during the calculation of their similarity. His routing algorithm computes the similarity iteratively.
OK, here's a preliminary, hand-wavy, high-level summary of my understanding so far:
It seems we could repeatedly apply attention as an iterative update rule, feeding the output sequence as the new queries: new_Q = Attention(old_Q, K, V). Obviously, no one ever does that. But in theory we could, and apparently it would work just as well.
In self-attention, we apply dense layers to the input sequence to get K and V, so we can rewrite the update rule like this: new_Q = Attention(old_Q, input_seq). Internally, Attention would have to compute and cache K and V from input_seq. The input sequence is given and remains fixed, so we can rewrite the update rule again to make that explicit, like this: new_Q = Attention(old_Q | input_seq).
According to the preprint, everything the routing algorithm does can be written as an update rule U that looks just like that, except that the preprint calls the queries "x_out" and the input sequence "x_inp". Using their notation, the update rule looks like this (section 3.3 and appendix B): x_out <-- U(x_out | x_inp).
One more thing that seems important: In self-attention, we get the initial queries by applying a dense layer to the input sequence. In the routing algorithm, the initial queries come instead from the same update rule, U, by assuming all possible output states have equal probability. Interestingly, the code in the github repo executes only two updates by default: one to compute the initial queries, and another one to get updated queries, which are returned as the output sequence. This isn't too different from what we normally do with attention: first we compute the initial queries, and then we apply one update to get updated queries, which are returned as the output sequence.
MORE EDITS: There are quite a few other differences vs attention that look significant. Preliminary and hand-wavy: Like other capsule-routing networks, this one gives us credit assignments that tell us how much each capsule (token embedding) in the output sequence depends on each capsule (token embedding)in the input sequence. Unlike other capsule-routing networks, this one seems to scale up really well, even though V has n x m x d elements, vs. n x d in attention.
I'm still trying to make my way through the preprint :-)
EDIT: According to https://ml-jku.github.io/hopfield-layers/#update , attention is the update rule for an (iterative) "dense associate memory," even though in practice it seems that one update works really, really well for attention if you train it with SGD.
EDIT: I feel compelled to mention that the efficient implementation of that more general routing algorithm can handle input sequences with more than 1M token embeddings in a single GPU, which quite frankly seems like it should be impossible but somehow it works: https://github.com/glassroom/heinsen_routing#routing-very-lo....