Embeddings tables aren't hard on the GPU (being only a lookup table), and the output softmax still requires you do the full matrix-multiply. The label may be sparse, but the computation is far from sparse.
> The reverse is true, embeddings are both the performance and memory-footprint bottleneck of modern NN models.
They may be a bottleneck, but the alternative is worse -- you can't fit complex models with large vocabularies into GPU memory using sparse one-hot encodings.
Technically, the sparse one-hot encoding is the most efficient in terms of memory footprint. You simply store the non-zero coordinates.
The problem in practice for GPUs is that sparse vector/matrix operations are too inefficient.
The whole point of something like this paper is to skip the entire 'densification' step and to directly deal with the sparse matrix input as a sparse matrix. The LSH is used in this paper improves on directly using SpMSpV, as that is also inefficient on CPUs, although to a lesser extent than GPUs.