I've been thinking about more or less the same idea, but the computational edge inference costs probably makes it impractical for most of today's client devices. I see a lot of potential in this direction in the near future though.
I think it's unclear how much computational resources the uncompression steps take.
At the moment it's fairly fast, but RAM hungry. But this article makes it clear that quantizing the representation works well (at least for the VAE). It's possible quantized models could also do decent jobs.