The basic premise of these kinds of compression algorithms is actually pretty cl...

The basic premise of these kinds of compression algorithms is actually pretty clever. Here's a very very trivialization of this style of approach:

1. both the compressor and decompressor contain knowledge beyond the algorithm used to compress/decompress some data

2. in this case the knowledge might be "all the images in the world"

3. when presented with an image, the compressor simply looks up some index or identifier of the the image

4. the identifier is passed around as the "compressed image"

5. "decompression" means looking up the identifier and retrieving the image

I've heard this called "compression via database" before and it can give the appearance of defeating Shannon theorem for compression even though it doesn't do that at all.

Of course the author's idea is significantly more sophisticated than the approach above, and trades a lossy approach for some gains in storage and retrieval efficiency (we don't have to have a copy of all of the pictures in the world in both the compressor and the decompressor). The evaluation note of not using any known image for the tests further challenges the approach and helps sus-out where there are specific challenge like poor reconstruction of specific image constructs like faces or text -- I suspect that there are many other issues like these but the author honed in on these because we (as literate humans) are particularly sensitive to them.

In these types of lossy compression approaches (as opposed to the above which is lossless) the basic approach is:

1. Throw away data until you get to the desired file size. You usually want to come up with some clever scheme to decide what data you toss out. Alternative, just hash the input data using some hash function that produces just the right number of bits you want, but use a scheme that results in a hash digest that can act as a (non-unique) index to the original image in a table of every image in the world.

2. For images it's usually easy to eliminate pixels (resolution) and color (bit-depth, channels, etc.). In this specific case, the author uses an variational autoencoder to "choose" what gets tossed. I suspect the autoencoder is very good at preserving information rich, or high-entropy, information dense slices of a latent space or something. At any rate, this produces something that to us sorta kinda looks like a very low resolution, poorly colored postage stamp of the original image, but actually contains more data than that. I think at this point it can just be considered the hash digest.

3. this hash digest, or VAE encoded image or whatever we want to call it, is what's passed around as the "compressed" data.

4. just like above, "decompression" means effectively looking up the value in a "database". If we are working with hash digests, there was probably a collision during the construction of the database of all images, so we lost some information. In this case we're dealing with stable diffusion and instead of a simple index->table entry, our "compressed" VAE image wraps through some hyperspace to find the nearest preserved data. Since the VAE "pixels" probably align close to data dense areas of the space you tend to get back data that closely represents the original image. It's still a database lookup in that sense, but it's looking more for "similar" rather than "exact matches" which when used to rebuild the image give a good approximation of the original.

Because it's an "approximation" it's "lossy". In fact I think it'd be more accurate to say it's "generally lossy" as there is a chance the original image can be reproduced exactly, especially if it's in the original training data. Which is why the author was careful not to use anything from that set.

Because we've stored so much information in the compressor and decompressor, it can also give the appearance of defeating Shannon entropy for compression except it's also not because:

a) it's generally lossy

b) just like the original example above we're cheating by simply storing lots of information elsewhere

There's probably some deep mathematical relationship between the author's approach and compressive sensing.

Still, it's useful, and has the possibility of improving data transmission speeds at the cost of storing lots of local data at both ends.

Source: Many years ago before deep learning was even a "thing", I worked briefly on some compression algorithms in an effort to reduce data transfer issues in telecom poor regions. One of our approaches was not too dissimilar to this -- throw away a bunch of the original data in a structured way and use a smart algorithm and some stored heuristics in the decompressor to guess what we threw away. Our scheme had the benefit of almost absolutely trivial "compression" with the downside of massive computational needs on the "decompression" side, but had lots of nice performance guarantees which you could use to design the data transport stuff around.

*edit* sorry if this explanation is confusing, it's been a while and it's also very late where I am. I just found this post really fun.