> If security is important, the answer to that is no
It's a little more nuanced than that. Compression may cause information leaks or it can prevent them depending on the circumstances. If you're encrypting an audio stream, then compressing it first can cause leaks. If you're encrypting a document, then compressing it first may prevent leaks.
I think compression is a read herring here. Think about keyboard-interactive ssh authentication. Whether or not the data is compressed, the timing of the packets depends on the timing between when you pressed the keys on your keyboard. Thus information is leaked, and compression was not involved.
Therefore, I feel like compression complicates the discussion unnecessarily. Information that's not encrypted (the time a key was typed or the spacing between phonemes in a word) can be recovered by an attacker without breaking the encryption. Obvious when you say it like that ;)
Yes, in other words compression can move information from the content of the data (which is encrypted) to the size of the data (which is not encrypted).
Which is one of the selling points of block cyphers. They may not conceal the magnitude of the payload but they do obscure the exact byte count by rounding up. But that probably doesn't stop CRIME from being workable.
Hmm, I don't think it is really a selling point. A block cipher also implies needing a technique like CBC to prevent watermarking, and using CBC actually helps with oracle attacks.
Oh, I'm a fan of prepadding, and there are other surveillance problems that TLS doesn't solve that are similar to the voice analysis attack. Web pages on a site have distinct patterns of load operations and I'm certain that if we haven't established how to profile a visitor by the pattern of packets that are sent and received, that it's only a matter of time until we do.
well. My first comment would be. If your encryption algorithm produces output that can be compressed. Your encryption algorithm is broken.
As for compress. My understanding is that the inscurity has nothing much to do with timing, and everything to do with giving an attacker some "known" plaintext they can use Bayesian (Turing?) analysis against to shorten the key space.
No, read the dang article. You leak information about the content of the plaintext because you're changing the length. The crib doesn't matter with secure constructions.
1 encrypted 512bit block is always more secure than 12 512bit encrypted blocks (average compression ratio). Larger size is always a bad thing.
Problem with compress is the protocol headers give you a known part at the start of every message. Which you can then very quickly test keys against rainbow tables or other statistical means.
The best solution i know for this is to compress, then xor with a stream cipher then run into a decent block cipher.
The "compress after encrypt" is a trick question. If your data doesnt get bigger after that your encryption implementation is royally broken.
But you definately don't want every "plaintext" message starting with "..gzip".
There's a pretty easy proof to show why compression of truly random data is impossible. Let C be a compression algorithm. Assume that for all bit strings S, C(S) is not longer than S, and assume that there exists strings S' such that C(S') is shorter than S'. Also, assume that for any two distinct bit strings S1 and S2, C(S1) != C(S2). Let S'' be the shortest such string: the shortest string that C shortens. Let the length of S'' be M, and the length of C(S'') be N < M. Note N >= 1, and by assumption every string of length N does not have its length changed by C. There are 2^N strings of length N, and by assumption it is easy to see that since the images of any pair of N bit strings under C are distinct, every string of N bits is the image of some other N bit string. But now, S'' has length M but has an image of length N. This necessarily collides with the image of some string of length N, a contradiction. Thus, for any compression algorithm which is injective (in effect, reversible) and successfully shortens some strings, the algorithm must lengthen some others.
What this short proof shows is that even though random data may have patterns, no compression algorithm can successfully leverage this for every random string.
To be clear, I'm not claiming that encrypted data is compressible. I'm just pointing out that any detectable "patterns" that appear in random data but not encrypted data are inconsistent with the claim that random and encrypted data are indistinguishable. Also if you find such a "pattern" that's probably a good starting point for breaking whatever cipher you're using.
Ah, but they can repeat! There is a finite chance that any sequence of any length will occur in a random sequence.
So while a compression alg applied to a very large amount of random data is unlikely to reduce the size, it is quite possible to achieve some compression on smaller blocks.
Any given _example_ of a random sequence may have less than maximum entropy, and so may be compressible, but a truly random sequence _in general_ will not be compressible. So you can make a compression algorithm that will compress random data in some cases but it will never be able to compress any arbitrary random data. Does this help?
That is still your problem and you need to take responsibility for it. It is wiser to learn and know what you're talking about, don't just add ambient, useless, incorrect noise.
Let's say you are using a one-time pad, and the encrypted text is a string of a million A characters. While unlikely to the point that it is barely worth mentioning, it is possible. And could trivially be compressed.
I think your main point is correct, just presented too strongly. In many cases compression after encryption will yield poor results. But I wouldn't go so far as to say that it can't be done.
This seems overly pedantic. In my mind "incompressible" does not mean "there are absolutely zero situations in which the output could be shorter than the input." There will always be (compression scheme, input) combinations that produce smaller outputs. The qualifying criteria is "does expected input data typically compress?"
For a reasonable definition of incompressible, properly encrypted data is it.
My intuition says that if your ciphertext after using a one-time pad is trivially compressible, it means there's a strong correlation between your plaintext and your key. It's an interesting thought.
Consider using basic modular addition of letters, with A=1, B=2 ... Z = 26. If your plaintext is HELLO, and you get AAAAA as ciphertext, then your key has to be SVOOL. If you look, that's just an inversion of each letter. H is the 8th letter of the alphabet, whereas S is the 8th from the end, etc. So, your one-time pad was incidentally a simple translation of your message.
I'm not sure if there's any interesting information being leaked this way. Perhaps by seeing that the encrypted message can be heavily compressed, or knowing that it has been heavily compressed, you can assume that the encryption key has some very improbable patterns, and use that to bound guesses about the encryption key.
Could you elaborate what kind of scenario would work on a audio stream but not on a document? In order to use CRIME, the attacker would need to repeatably injected content before the compression stage in order to test if it matches some part of the source. Audio is kind of heavy uncompressed and aren't normally streamed as such.
A raw audio stream is a constant bit rate stream. If you compress it, it becomes (in general) a variable bit rate stream. The bit rate contains information about the raw audio which survives encryption.
If you compress a document that simply obscures the size of the original document.
If you compress a document that simply obscures the size of the original document.
That is not generally true - if you compress AAA and ABC then the compressed version also leaks information about the content exactly like in the case of an audio stream and it reveals more information then the lengths of the uncompressed documents which don't differ.
Yes, you can concoct theoretical cases where compressing documents leaks information. But there's a huge difference: compressing a document only leaks a single data point whereas compressing a stream produces a continual information leak. Furthermore, there are common use cases (with audio being the poster child) where compression leaks in streams can be used in practical attacks. I'm not aware of even a single case of a compression leak being used in a real attack on an encrypted document.
Imagine a voice call, using a variable-bitrate codec, broken into small packets (not necessarily over the internet), of interest to a passive eavesdropper who cannot decrypt the content, but can observe metadata: the timing of packets and their sizes.
Feed that information into a prebuilt phoneme-in-context model trained on that codec and language, and said eavesdropper could probably reconstruct a pretty good estimated transcript - different phonemes compress differently.
There is no reason this would not work in bulk: I am under the impression GCHQ have done a fair amount of research in this area, for example.
So, it's not just compression per se, but compression plus real-time plus some knowledge about the underlying information.
These sorts of attacks don't matter against a compressed and then encrypted file of indeterminate data, correct? Similarly, an amalgamation of many types of data, such as a disk or archive?
If you know absolutely nothing about the uncompressed original, then it seems indeed pretty hard to infer anything from the compressed version. But as soon as you know something, for example the rough size of the uncompressed original, you can for example start to infer that the content is either more like AAA or more like ABC because repetitions will compress better. And in reality it is really unlikely that an attacker knows nothing at all about the content. You know the type of data, you know what the header structure of the file format looks like. Or you could try to learn something from the statistics of the observed messages. Maybe the data contains a time stamp that correlates with the time you observe the message. The possibilities are endless.
Sure, but isn't there more implied by what you just said? It's not just that you know the size, but if you know the header format, then you also know the type (which you might). But, I'm not sure how much more of less information compression provides for non-trivial size files. Where individual packets are encrypted, you have quite a bit of information. Where you have one file of non trivial size, there is much less information exposed.
For example, you know its type, and it's 150KB. Even if you know it's a text document, or a WAV file, or bitmap, I think you get far less information from the compressed version than from the non-compressed version. The non-compressed version of the text document may give you approximate word count, and the WAV file will give you some probable lengths of the recording at common sampling rates, and the bitmap will give you some probable dimensions. Compressed versions hide this information within the natural variance of the compression.
The problem seems less to do with compression, and more to do with splitting larger data (or streams of data) into smaller chunks to encrypt, which in itself loses some of the benefits of hiding information in the variance of the data presented. Compression seems to exacerbate this situation by amplifying variances in very small data sets in a way that yields additional information, but it seems to me that's just a natural progression of encrypting very small amounts of data being not nearly as effective as large amounts of data.
At least, that's what I can intuit about the situation. It may be partly (or totally) wrong given some more advanced security theory (I am not a security professional).
Thanks, its good to have specifics when thinking about security.
I would guess that the phones and voip is a perfect target to do this, as those system are often designed to filter out noise. I guess it also would be almost impossible to use this to transcribe a encrypted movie or music, as the unpredictable sounds would break the prediction model.
And bottles are used for holding liquids rather than smashing heads... until you find yourself in a bar fight.
I might be wrong, but I think the parent had something like this in mind:
Assuming you are encrypting with AES-CFB your each plaintext block P[i] produces a ciphertext block C[i] (with key K and initialization vector IV) according to rule:
Given that IV is assumed to be known by the attacker, if the plaintext is a vanilla-compressed stream, it is more likely than not that most of bits in P[0] are going to be known as well, which might allow some sort of prunned brute-force attack on key K, given a small set of (C[0] ^ P'[0]), for all P'[0] that satisfy the known bits in P[0].
This particular implementation would benefit then from adding a pseudorandom P[0] block (a "salt", if I understood correctly) that the receiver is to discard on arrival. I don't know enough cryptanalysis to tell if the above scenario is valid or not, but it sounds like a legitimate question at least.
The original article links to a paper describing an attack against an audio stream by exploiting knowledge of how the stream is encoded before encryption, then reconstructing it from the length of the encrypted stream:
It's more about whether the attacker has control of the any portion of the compressed contents. If they have control over the stream then you don't want compression, if they don't you do.
No, control is not necessary for compression to leak information. It's enough for the attacker simply to know something about the original stream content (e.g. that it is audio) and how it compresses.
Thank you. Your comment wasn’t clear for several reasons. Stream ciphers are regularly blocked in TLS, compression can occur per block. Compression may be done per block, not just against a file larger than a single block.
In GCM, the 'C' is for counter, and hence turning AES into a stream cipher. This part is usually known as 'AES-CTR'. The 'G' is Galois for 'Galois field multiplication', which allows for a parallelizable way of computing a MAC. AES-GCM packages AES-CTR (a stream cipher made from a block cipher) and GMAC (the Galois MAC) together into a primitive. This type of scheme, which combines confidentiality, integrity, and authenticity is called 'authenticated encryption with associated data' (AEAD) [1]. A stream cipher is the easiest way of accomplishing that the cipherstream will safely expand to cover all the plaintext.
Other famous AEAD schemes are:
- CCM (Counter with CBC-MAC), packages AES-CTR and CBC-MAC together in an authenticate-then-encrypt regime
- ChaCha20-Poly1305, which packages together the stream cipher ChaCha20 and the MAC Poly1305.
The AES-GCM mode of operation turns the AES block cipher into a stream cipher. Doing this makes it possible to add authentication on top for relatively cheap. The ChaCha20-poly1305 has a similar construction. These are the two most secure and efficient ciphers available in TLS, so they get a lot of use. If you are sending documents over HTTPS, you definitely want them encrypted with one of these two.
It's a little more nuanced than that. Compression may cause information leaks or it can prevent them depending on the circumstances. If you're encrypting an audio stream, then compressing it first can cause leaks. If you're encrypting a document, then compressing it first may prevent leaks.