Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> If security is important, the answer to that is no

It's a little more nuanced than that. Compression may cause information leaks or it can prevent them depending on the circumstances. If you're encrypting an audio stream, then compressing it first can cause leaks. If you're encrypting a document, then compressing it first may prevent leaks.



I think compression is a read herring here. Think about keyboard-interactive ssh authentication. Whether or not the data is compressed, the timing of the packets depends on the timing between when you pressed the keys on your keyboard. Thus information is leaked, and compression was not involved.

Therefore, I feel like compression complicates the discussion unnecessarily. Information that's not encrypted (the time a key was typed or the spacing between phonemes in a word) can be recovered by an attacker without breaking the encryption. Obvious when you say it like that ;)


Well, yeah, but the point is that compression can cause the leakage of information through timing that would not otherwise have leaked.


Yes, in other words compression can move information from the content of the data (which is encrypted) to the size of the data (which is not encrypted).


Which is one of the selling points of block cyphers. They may not conceal the magnitude of the payload but they do obscure the exact byte count by rounding up. But that probably doesn't stop CRIME from being workable.


Hmm, I don't think it is really a selling point. A block cipher also implies needing a technique like CBC to prevent watermarking, and using CBC actually helps with oracle attacks.

An interesting paper on using padding to obscure length: http://cihangir.forgottenlance.com/papers/lengthhiding-corre...


Oh, I'm a fan of prepadding, and there are other surveillance problems that TLS doesn't solve that are similar to the voice analysis attack. Web pages on a site have distinct patterns of load operations and I'm certain that if we haven't established how to profile a visitor by the pattern of packets that are sent and received, that it's only a matter of time until we do.


well. My first comment would be. If your encryption algorithm produces output that can be compressed. Your encryption algorithm is broken.

As for compress. My understanding is that the inscurity has nothing much to do with timing, and everything to do with giving an attacker some "known" plaintext they can use Bayesian (Turing?) analysis against to shorten the key space.


No, read the dang article. You leak information about the content of the plaintext because you're changing the length. The crib doesn't matter with secure constructions.


1 encrypted 512bit block is always more secure than 12 512bit encrypted blocks (average compression ratio). Larger size is always a bad thing.

Problem with compress is the protocol headers give you a known part at the start of every message. Which you can then very quickly test keys against rainbow tables or other statistical means.

The best solution i know for this is to compress, then xor with a stream cipher then run into a decent block cipher.

The "compress after encrypt" is a trick question. If your data doesnt get bigger after that your encryption implementation is royally broken.

But you definately don't want every "plaintext" message starting with "..gzip".


> read herring

Freudian slip? I like it anyway.


I swear HN's font size is now too small for my eyes by default. I can't make any post without introducing a major tyop these days :)


You can't compress after encryption.

Encrypted content should be indistinguishable from random data. So encrypt than compress shouldn't be able to yield any reasonable compression.


wit truly random data, patterns should randomly appear.


There's a pretty easy proof to show why compression of truly random data is impossible. Let C be a compression algorithm. Assume that for all bit strings S, C(S) is not longer than S, and assume that there exists strings S' such that C(S') is shorter than S'. Also, assume that for any two distinct bit strings S1 and S2, C(S1) != C(S2). Let S'' be the shortest such string: the shortest string that C shortens. Let the length of S'' be M, and the length of C(S'') be N < M. Note N >= 1, and by assumption every string of length N does not have its length changed by C. There are 2^N strings of length N, and by assumption it is easy to see that since the images of any pair of N bit strings under C are distinct, every string of N bits is the image of some other N bit string. But now, S'' has length M but has an image of length N. This necessarily collides with the image of some string of length N, a contradiction. Thus, for any compression algorithm which is injective (in effect, reversible) and successfully shortens some strings, the algorithm must lengthen some others.

What this short proof shows is that even though random data may have patterns, no compression algorithm can successfully leverage this for every random string.


Encrypted data is "indistinguishable from random data" but it is not actually random data. No patterns will appear.


If patterns appear in random data, and no patterns appear in encrypted data, then it's trivially distinguishable from random data: look for patterns.


To be clear, I'm not claiming that encrypted data is compressible. I'm just pointing out that any detectable "patterns" that appear in random data but not encrypted data are inconsistent with the claim that random and encrypted data are indistinguishable. Also if you find such a "pattern" that's probably a good starting point for breaking whatever cipher you're using.



I'm sorry, that article is far too long to tell me what you are trying to say. All I can gather is, I was wrong


tl;dr Random data may appear to have patterns, but those patterns do not repeat, so they cannot be compressed.


Ah, but they can repeat! There is a finite chance that any sequence of any length will occur in a random sequence.

So while a compression alg applied to a very large amount of random data is unlikely to reduce the size, it is quite possible to achieve some compression on smaller blocks.


Any given _example_ of a random sequence may have less than maximum entropy, and so may be compressible, but a truly random sequence _in general_ will not be compressible. So you can make a compression algorithm that will compress random data in some cases but it will never be able to compress any arbitrary random data. Does this help?


You should take a look at the almost 15 year old (and unsolved) "Random Compression Challenge"

http://marknelson.us/2012/10/09/the-random-compression-chall...


That is still your problem and you need to take responsibility for it. It is wiser to learn and know what you're talking about, don't just add ambient, useless, incorrect noise.


Saying something true but burying it is not wise either.


Let's say you are using a one-time pad, and the encrypted text is a string of a million A characters. While unlikely to the point that it is barely worth mentioning, it is possible. And could trivially be compressed.

I think your main point is correct, just presented too strongly. In many cases compression after encryption will yield poor results. But I wouldn't go so far as to say that it can't be done.


This seems overly pedantic. In my mind "incompressible" does not mean "there are absolutely zero situations in which the output could be shorter than the input." There will always be (compression scheme, input) combinations that produce smaller outputs. The qualifying criteria is "does expected input data typically compress?"

For a reasonable definition of incompressible, properly encrypted data is it.


My intuition says that if your ciphertext after using a one-time pad is trivially compressible, it means there's a strong correlation between your plaintext and your key. It's an interesting thought.

Consider using basic modular addition of letters, with A=1, B=2 ... Z = 26. If your plaintext is HELLO, and you get AAAAA as ciphertext, then your key has to be SVOOL. If you look, that's just an inversion of each letter. H is the 8th letter of the alphabet, whereas S is the 8th from the end, etc. So, your one-time pad was incidentally a simple translation of your message.

I'm not sure if there's any interesting information being leaked this way. Perhaps by seeing that the encrypted message can be heavily compressed, or knowing that it has been heavily compressed, you can assume that the encryption key has some very improbable patterns, and use that to bound guesses about the encryption key.

Not an expert. Not even an intermediate.


Could you elaborate what kind of scenario would work on a audio stream but not on a document? In order to use CRIME, the attacker would need to repeatably injected content before the compression stage in order to test if it matches some part of the source. Audio is kind of heavy uncompressed and aren't normally streamed as such.


A raw audio stream is a constant bit rate stream. If you compress it, it becomes (in general) a variable bit rate stream. The bit rate contains information about the raw audio which survives encryption.

If you compress a document that simply obscures the size of the original document.


If you compress a document that simply obscures the size of the original document.

That is not generally true - if you compress AAA and ABC then the compressed version also leaks information about the content exactly like in the case of an audio stream and it reveals more information then the lengths of the uncompressed documents which don't differ.


Yes, you can concoct theoretical cases where compressing documents leaks information. But there's a huge difference: compressing a document only leaks a single data point whereas compressing a stream produces a continual information leak. Furthermore, there are common use cases (with audio being the poster child) where compression leaks in streams can be used in practical attacks. I'm not aware of even a single case of a compression leak being used in a real attack on an encrypted document.


Imagine a voice call, using a variable-bitrate codec, broken into small packets (not necessarily over the internet), of interest to a passive eavesdropper who cannot decrypt the content, but can observe metadata: the timing of packets and their sizes.

Feed that information into a prebuilt phoneme-in-context model trained on that codec and language, and said eavesdropper could probably reconstruct a pretty good estimated transcript - different phonemes compress differently.

There is no reason this would not work in bulk: I am under the impression GCHQ have done a fair amount of research in this area, for example.


So, it's not just compression per se, but compression plus real-time plus some knowledge about the underlying information.

These sorts of attacks don't matter against a compressed and then encrypted file of indeterminate data, correct? Similarly, an amalgamation of many types of data, such as a disk or archive?


If you know absolutely nothing about the uncompressed original, then it seems indeed pretty hard to infer anything from the compressed version. But as soon as you know something, for example the rough size of the uncompressed original, you can for example start to infer that the content is either more like AAA or more like ABC because repetitions will compress better. And in reality it is really unlikely that an attacker knows nothing at all about the content. You know the type of data, you know what the header structure of the file format looks like. Or you could try to learn something from the statistics of the observed messages. Maybe the data contains a time stamp that correlates with the time you observe the message. The possibilities are endless.


Sure, but isn't there more implied by what you just said? It's not just that you know the size, but if you know the header format, then you also know the type (which you might). But, I'm not sure how much more of less information compression provides for non-trivial size files. Where individual packets are encrypted, you have quite a bit of information. Where you have one file of non trivial size, there is much less information exposed.

For example, you know its type, and it's 150KB. Even if you know it's a text document, or a WAV file, or bitmap, I think you get far less information from the compressed version than from the non-compressed version. The non-compressed version of the text document may give you approximate word count, and the WAV file will give you some probable lengths of the recording at common sampling rates, and the bitmap will give you some probable dimensions. Compressed versions hide this information within the natural variance of the compression.

The problem seems less to do with compression, and more to do with splitting larger data (or streams of data) into smaller chunks to encrypt, which in itself loses some of the benefits of hiding information in the variance of the data presented. Compression seems to exacerbate this situation by amplifying variances in very small data sets in a way that yields additional information, but it seems to me that's just a natural progression of encrypting very small amounts of data being not nearly as effective as large amounts of data.

At least, that's what I can intuit about the situation. It may be partly (or totally) wrong given some more advanced security theory (I am not a security professional).


Thank you, that was the tldr I was looking for in this thread.


Thanks, its good to have specifics when thinking about security.

I would guess that the phones and voip is a perfect target to do this, as those system are often designed to filter out noise. I guess it also would be almost impossible to use this to transcribe a encrypted movie or music, as the unpredictable sounds would break the prediction model.


Could this threat be neutralized by salting before compressing?


No. Salting is to keep stop pre-image attacks in encryption. It doesn’t affect compression.


And bottles are used for holding liquids rather than smashing heads... until you find yourself in a bar fight.

I might be wrong, but I think the parent had something like this in mind:

Assuming you are encrypting with AES-CFB your each plaintext block P[i] produces a ciphertext block C[i] (with key K and initialization vector IV) according to rule:

    C[0] = AES_encrypt(K, IV) ^ P[0], 
    C[i] = AES_encrypt(K, C[i-1]) ^ P[i].
Given that IV is assumed to be known by the attacker, if the plaintext is a vanilla-compressed stream, it is more likely than not that most of bits in P[0] are going to be known as well, which might allow some sort of prunned brute-force attack on key K, given a small set of (C[0] ^ P'[0]), for all P'[0] that satisfy the known bits in P[0].

This particular implementation would benefit then from adding a pseudorandom P[0] block (a "salt", if I understood correctly) that the receiver is to discard on arrival. I don't know enough cryptanalysis to tell if the above scenario is valid or not, but it sounds like a legitimate question at least.


The original article links to a paper describing an attack against an audio stream by exploiting knowledge of how the stream is encoded before encryption, then reconstructing it from the length of the encrypted stream:

https://news.ycombinator.com/item?id=11995311


> If you're encrypting a document, then compressing it first may prevent leaks.

I think this is only true if you use a bad encryption algorithm where same plaintext blocks have the same ciphertext.


Oh, I'm not talking about leaking plaintext, I'm talking about leaking the length of the document.


It's more about whether the attacker has control of the any portion of the compressed contents. If they have control over the stream then you don't want compression, if they don't you do.


No, control is not necessary for compression to leak information. It's enough for the attacker simply to know something about the original stream content (e.g. that it is audio) and how it compresses.


Interesting. Keepass[0] (password store) compress with GZip by default before encryption.

[0] http://keepass.info/help/v2/dbsettings.html#compression


Are you talking about compressing plain text larger than a block, then encrypting per each block of the compressed output?


Yes, of course. What would be the alternative?


Thank you. Your comment wasn’t clear for several reasons. Stream ciphers are regularly blocked in TLS, compression can occur per block. Compression may be done per block, not just against a file larger than a single block.


@lisper

Stream ciphers.


Why would you use a stream cipher to encrypt a document?


In GCM, the 'C' is for counter, and hence turning AES into a stream cipher. This part is usually known as 'AES-CTR'. The 'G' is Galois for 'Galois field multiplication', which allows for a parallelizable way of computing a MAC. AES-GCM packages AES-CTR (a stream cipher made from a block cipher) and GMAC (the Galois MAC) together into a primitive. This type of scheme, which combines confidentiality, integrity, and authenticity is called 'authenticated encryption with associated data' (AEAD) [1]. A stream cipher is the easiest way of accomplishing that the cipherstream will safely expand to cover all the plaintext.

Other famous AEAD schemes are:

- CCM (Counter with CBC-MAC), packages AES-CTR and CBC-MAC together in an authenticate-then-encrypt regime

- ChaCha20-Poly1305, which packages together the stream cipher ChaCha20 and the MAC Poly1305.

[1] https://en.wikipedia.org/wiki/Authenticated_encryption


The AES-GCM mode of operation turns the AES block cipher into a stream cipher. Doing this makes it possible to add authentication on top for relatively cheap. The ChaCha20-poly1305 has a similar construction. These are the two most secure and efficient ciphers available in TLS, so they get a lot of use. If you are sending documents over HTTPS, you definitely want them encrypted with one of these two.


Fair enough, I concede the point.


Why wouldn't you?


http://security.stackexchange.com/questions/334/advantages-a...

Not that this really matters. The advantages/disadvantages of compression are the same in both cases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: