This competition is flawed in the context of artificial intelligence. Most of the information in the Wikipedia text is in the language, not in the 'actual' knowledge. It would be much more interesting competition if it would allow the uncompressed text to say the same things in different words. Although there's the problem of how to verify if it's the same thing.
Compression is closely related to intelligence. In order to compress something you have to understand the pattern behind it. The better you can learn the pattern, the better you can compress.
Lossy compression, yes. Lossless compression, not so much. And, the parent post seems to have been hinting at this. I.e. being able to usefully "tl;dr" all of Wikipedia would require a rather intelligent system. But, recapitulating all of Wikipedia's content word-for-word and pixel-for-pixel (including page layout for the truly pedantic) would not meaningfully test the sort of abstract pattern recognition that you seem to be confusing with lossless compression.
In other words, though often conflated, information and meaning are not equivalent. "Meaning", as people typically use the word, seems to include some concept of the utility of a bit of information. Most information in typical text/images/sounds/etc. is really quite useless. Being able to discard the useless while keeping the useful actually seems like one of the key traits exhibited by intelligent systems, and this ability is not measured (and perhaps even selected against) by lossless compression performance.
Loseless compression may not be as useful as a summarizing algorithm, but it is definitely a test of intelligence. Lossless compression requires identifying the pattern that produced an input as perfectly as possible. If you have an algorithm that can predict the next letter with great probability, you only only need a few bits to store a string of text. And making good predictions on real world data is a good test of AI and machine learning.
"Lossless compression requires identifying the pattern that produced an input as perfectly as possible": no, it requires identifying it absolutely perfectly. This includes all vacuous information as well. E.g., in the Wikipedia example, if there were a 2d scatter plot of a sample from a bivariate uniform distribution, lossless compression would require "memorizing" all of the plotted points.
Predicting perfectly is much, much different from predicting well. Machine learning is about the latter, while lossless compression is about the former.
This is not true. If you have a good predictor, you only need a few bits to store a piece of information. One way is just to record the places where your prediction is wrong. The ideal way would be to split all the possibilities so exactly half of possible sequences are on one side, and exactly half the probability is on the other. Every bit tells you what path to go down.
So instead of using 64 bits to specify the x y coordinates of every point on the plot, you could just use a much smaller number to represent how far it diverges from it's predicted location. You could narrow down the possible locations the point could be in by half, and then you just bits to specify only those possibilities, not all of them.
> Most of the information in the Wikipedia text is in the language, not in the 'actual' knowledge.
Exactly, I see the same problem. With the current rules, this is just a challenge of text compression.
I believe compressing the knowledge in the Wikipedia would be more akin to extracting meaning with NLP into a highly-normalized dataset, and then compressing that. From an AI perspective, this is more close to how our memory actually works. Unless you're savant, you don't memorize the actual text of a topic, but the inferred associations, which compress much better, although lossy.
You give lossless compressors too little credit. An optimal text compressor (with a sane running time) should be able to "understand" the context to generate a density function over words close to the true one to encode optimally. The lossy one if optimal in some manner should generate this same density but it would use an error metric to achieve a "rate-distortion" boundary, and encode optimally based on the desired position on that boundary.
So what separates the two is the error metric, in other words, how much humans would bother if you replaced y for x. But this involves humans, not general intelligence -- you'd need a lot of testing with actual humans to get good results (jpeg has extensive testing to find optimal quantizers for chroma values based on human perception, for example).
Not that the latter would not be interesting too; I just think it's less practical.
I had something else in mind, I think the parent thought that too.
Imagine, for instance, how many articles there are about how A has relation R with C, and B also (e.g., George Washington was a president of the USA, Abraham Lincoln also). For some definition of "understanding", you wouldn't need to encode data about the text of both articles A and B, you could use NLP to somehow store just the ontology. The problem would change from encoding text to encoding a graph. So it would be a lossy process (because you wouldn't be able to decompress into the original Wikipedia), but in my opinion seems more relevant to AI than bothering with the original corpus.
Would it be considered cheating if you found a pseudo random generator that took a small seed and expanded to the executable of the current first place winner?
It is a common misconception among people designing compression algorithms that such a technique is possible at all in general, or feasible for specific cases (it is neither).
As an illustration, consider a shuffled deck of playing cards, and you want to find a seed for your random number generator so you can just store the seed instead of the entire deck. If your seed is a 64-bit number, then the chance that you can represent the deck using a seed is about 1 in 4x10^48, or basically zero. Or phrased the opposite way, if you use a random number generator with a 64-bit state to shuffle a deck, then the vast majority of possible decks will never be output by your shuffling program.
I mean, if there would be God and he would want to prove that he really is all powerful, he would have to be able to pull that kind of trick to convince me.
That depends on the advertising you decide to accept as valid; I'm sure none of my Christian friends would agree with you, assuming I could explain the scenario to them.
If you take it for granted that the pretext of the question is an analogy of the creation of the universe, and the consequence of the ability is the knowledge of all that occurs, then that isn't how omniscience was explained to me because a key aspect of Catholicism is free will, and that would allow God to sort out the sinners without running the "experiment", viz. the universe.
I have to mention that this is only according to my meagre and disinterested understanding.
It's true that a given small seed can generate a long sequence. Sequences like these have low http://en.wikipedia.org/wiki/Kolmogorov_complexity, since it's possible to write a short program that generates the comparatively long sequence.
A pseudo-random number generator with a seed length of 64 bits is capable of generating about 2^64 possible sequences (roughly one for each possible seed).
Given a random long sequence, though, it's quite unlikely
that there's a short seed that will generate it; given a seed size of 64 bits and a random sequence of 100 bits, for example, the chance that there's a seed that will generate the sequence is roughly (2^64)/(2^100), or about 1 in 100 billion.
A theoretically optimal solution to Hutter's challenge would somehow notice whenever this happens to be the case, across all possible seeds and all possible sequence-generating functions, and provide the seed and function as the solution!
So, no: it wouldn't be cheating. But, it's statistically likely to be impossible.
No, this would be precisely the sort of optimization that they are seeking.
But... It's hard. To do this over large numbers of high-entropy bits with a small seed (varying) and some simple algorithm (varying) rapidly requires a search space of a scale implying more computation than might plausibly be made to exist in our observable universe. There has to be a minimum-message-length compression solution for any arbitrary message and decoder environment (whether their code or the uncompressed text), but good luck provably finding it for messages even as long as this post.
"You would have to be a god" is a reasonable response if it's talking about using computation capability outside of our observable universe and the physical rules we understand it by.
Some other problems with the prize task as defined:
* Wikipedia is light on images (due to copyright issues I suppose) but the wider challenge of compressing 'human knowledge' really has to deal with the vast masses of paper books. Which are fundamentally images (before they are OCR'd and compressed.) Also since this is really an exercise in preservation of historical record you really want to preserve the exact original visual appearance, blemishes and all, which means the true image must be retained along with the OCR text. Additionally there other recording media- photos, film, etc to capture.
* When you include analog original formats (of which ink on paper is one) then the whole idea of 'lossless' compression is moot. What you're really after is compression with no _perceptible_ content loss.
* Which means that the issue of 'acceptable quality' is crucial. And deciding what is acceptable quality loss for different forms of source material is something that will require very good AI.
For instance, images that are printed with mixes of offset printed screening (those tiny dot patterns) and solid ink edges (text, hard lines, etc) is very difficult to compress. You can't just blur everything to reduce the screening dots to an even gradient, because that ruins the edge definition of the text and lines. You can't scan at a lower or similar resolution to the dots, since that produces horrible moire patterning. You can't scan and save the image at a high enough resolution to capture the dots exactly, since that makes the filesize HUGE.
So your AI has to actually 'understand' the image to some extent, and smooth out just the screened areas, with precise but hard to identify mask edges. If you've ever done this by hand in photoshop, you'll know how hard it is.
That is actually a problem I'm stuck on with some historic technical manuals I'm trying to make digital copies of atm. If anyone knows of an automated way to blur offset screening (but not other printed elements on the page) to evenly shaded tones, please say so. See: http://everist.org/archives/scans/query/image_processing_que...
Another question: are there any freeware compression utils, that can do the RARbook trick? ie pass them a file.jpg, and they'll ignore the type extension and just scan through the file for a valid archive header then unpack from there on. As WinRar does. So frustrating that WinZip, ALzip etc don't do this.
http://mattmahoney.net/dc/text.html
https://news.ycombinator.com/item?id=6001489
It's linked in the OP but easy to overlook.
Here's how the prize winner compares with generic tools on en-wiki text (from Mahoney's page):