missed opportunity to explain how compression and prediction are related, and that the better you can predict the next token the better your compression gets, then your article gets to mention GPT hey
AIXI / Solomonoff prediction uses lossless compression, which may lead to massive overfitting. If anything, some degree of lossy compression would be "equivalent" to intelligence. Ockham's razor also says that the simplicity of a hypothesis can outweigh another hypothesis which better describes the available evidence. It's a trade-off, and AIXI doesn't make that trade-off, but insists on perfect compression/prediction of the available data.
It's basically the curve-fitting problem: You don't want the simplest curve that fits all the available data points perfectly, you want an even simpler curve that still fits the evidence reasonably well. If you hit the right balance between simplicity and fit, you can expect your model to generalize to unseen data, to make successful predictions. That would be intelligence, or some major part of it.
I think in a sense compression is worse - because not only you want to correctly predict the next token, you also want to do it fast, with a minimal but efficient algorithm that also doesn't require much space / a big dictionary.
You could think of it as taking a "snapshot" if an AI and then optimizing the hell out of it for a specific case and you end up with a good compression algorithm.
The modality of the data contains an amount of information comparable to the data itself. Telling ChatGPT that it's hearing music rather than a story would probably help it reason about it a lot.
On a lower level, you can't tell ChatGPT to reason about an image when its only input is a microphone.