Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The writing style is amusing. :)

Some notes from a first glance:

* In the experiments, I see that he uses the Single Headed Attention model actually also with 4 heads, which is kind of a contradiction to the name, isn't it?

* The main motivation is performance (training speed mostly). So some absolute number of e.g. training time would be nice to have in the comparisons. He e.g. mentions that the Adaptive Transformer can also be trained on a single GPU within hours, and in the comparison, the Adaptive Transformer gets much better BPC (enwik8), and uses even slightly less parameters. So, isn't the Adaptive Transformer thus better in every aspect (speed and BPC)? Or how does it compare in speed? As far as I remember, also the Sparse Transformer is more efficient (as it has sparsity), so again the speed comparison would be interesting here. Or is the argumentation for inference speed? But then the inference speed should be compared, or not?



I don't think that was his motivation, I think his motivation was stated quite clearly in the abstract:

> The author's lone goal is to show that the entire field might have evolved a different direction if we had instead been obsessed with a slightly different acronym and slightly different result.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: