Funnily enough, I'm currently trying to make my way through a preprint showing t...

eigenvalue · on Dec 23, 2022

Wow, this sounds exactly like what I was talking about. Thanks for the reference.

cs702 · on Dec 25, 2022

:-)

For those reading this after-the-fact, this thread started as a response to eigenvalue's original headline, a question, which was subsequently moved to a separate comment: https://news.ycombinator.com/item?id=34110868

p1esk · on Dec 23, 2022

How does his routing algorithm compare to attention? I saw this question in the repo faq, but no satisfactory answer is given.

cs702 · on Dec 23, 2022

I think the next-to-last faq ("Is it true that EfficientVectorRouting implements a model of associative memory?") answers that. Did you see it?

p1esk · on Dec 23, 2022

Oh I see, thanks! Interesting. It sounds like this is some kind of dynamic attention, as opposed to static attention in transformers, where queries and key don’t change during the calculation of their similarity. His routing algorithm computes the similarity iteratively.

Is this your assessment as well?

cs702 · on Dec 25, 2022

OK, here's a preliminary, hand-wavy, high-level summary of my understanding so far:

It seems we could repeatedly apply attention as an iterative update rule, feeding the output sequence as the new queries: new_Q = Attention(old_Q, K, V). Obviously, no one ever does that. But in theory we could, and apparently it would work just as well.

In self-attention, we apply dense layers to the input sequence to get K and V, so we can rewrite the update rule like this: new_Q = Attention(old_Q, input_seq). Internally, Attention would have to compute and cache K and V from input_seq. The input sequence is given and remains fixed, so we can rewrite the update rule again to make that explicit, like this: new_Q = Attention(old_Q | input_seq).

According to the preprint, everything the routing algorithm does can be written as an update rule U that looks just like that, except that the preprint calls the queries "x_out" and the input sequence "x_inp". Using their notation, the update rule looks like this (section 3.3 and appendix B): x_out <-- U(x_out | x_inp).

One more thing that seems important: In self-attention, we get the initial queries by applying a dense layer to the input sequence. In the routing algorithm, the initial queries come instead from the same update rule, U, by assuming all possible output states have equal probability. Interestingly, the code in the github repo executes only two updates by default: one to compute the initial queries, and another one to get updated queries, which are returned as the output sequence. This isn't too different from what we normally do with attention: first we compute the initial queries, and then we apply one update to get updated queries, which are returned as the output sequence.

EDITS: Many.

cs702 · on Dec 26, 2022

MORE EDITS: There are quite a few other differences vs attention that look significant. Preliminary and hand-wavy: Like other capsule-routing networks, this one gives us credit assignments that tell us how much each capsule (token embedding) in the output sequence depends on each capsule (token embedding)in the input sequence. Unlike other capsule-routing networks, this one seems to scale up really well, even though V has n x m x d elements, vs. n x d in attention.

cs702 · on Dec 23, 2022

I'm still trying to make my way through the preprint :-)

EDIT: According to https://ml-jku.github.io/hopfield-layers/#update , attention is the update rule for an (iterative) "dense associate memory," even though in practice it seems that one update works really, really well for attention if you train it with SGD.

dpflan · on Dec 23, 2022

Thank you. If you gain any insights while reading, please share!

cs702 · on Dec 25, 2022

https://news.ycombinator.com/item?id=34127486