As someone who is invested in researching said math, I can say with some confidence that it does not exist, or at least not in the form claimed here. That's the whole problem.
I would be ecstatic if it did though, so if anyone has any examples or rebuttal, I would very much appreciate it.
You're right that there is no purely mathematical argument; it's almost non-sensical to claim such.
Instead you can simply make the rather reasonable observation that LLMs are a product of their training distribution, which only contains partial coverage of all possible observable states of the world. Some highly regular observable states are thus likely missing, but an embodied agent (like a human) would be able to correctly reason about these states due to other sensory modes that observe these states. It's very clear that the number of possible states far exceeds any text describing those states (the number of states transcribed to text is vanishingly small in comparison, but they are locally coherent and quite regular when observed through other sensory modes like vision). That's the closest you can make to a mathematical argument.
As a result one common failure case for an LLM might involve describing a specific physical environment and associated set of actions that are in an area of low (or no) probability mass in the training data and asking the LLM to reason about it. Here's an example where I asked ChatGPT the following:
imagine i walked into a room through a doorway, then turned around and walked right back out of the room. from the perspective of a person in the room facing the door, which side would my right arm be? please explain.
It responded by incorrectly stating the following:
Let’s break this down step by step.
1. *You walk into the room through the doorway*:
* As you're entering, the person in the room is facing the door. So, you're entering from the **outside**.
* Your **right arm** will be on the **right side** of your body, just as it is normally when you're standing.
2. *You turn around and walk right back out*:
* When you turn around to walk out, you’ll be facing **away** from the room, looking back toward the doorway.
* Since you're walking **back out**, the person inside the room is still facing the door, while you are walking away from them, so your **right arm** is still on the right side of your body.
So, from the perspective of the person in the room, as you’re exiting, your *right arm* will still be on their *left side*. That’s because they’re facing the door, and your right side is to their left when you’re turning around and leaving.
To visualize:
* Imagine a door in front of them.
* You walk through and turn around, so you're now facing away from them.
* Your right arm is on your right side, but to them, it's on their left side since you’re now moving in the opposite direction.
My claim is more about that data processing is not enough. I was too vague and I definitely did not convey myself accurately. I tried to clarify a bit in a sibling comment to yours but I'm still unsure if it is sufficient tbh.
For embodiment, I think this is sufficient but not necessary. A key part to the limitation is that the agent cannot interact with its environment. This is a necessary feature for distinguishing competing explanations. I believe we are actually in agreement here, but I do think we need to be careful how we define embodiment. Because even a toaster can be considered a robot. It seems hard to determine what does not qualify as a body when we get to the itty gritty. But I think in general when people are talking about embodiment they are discussing the capability of being interventional.
By your elaboration I believe we agree since part of what I believe to be necessary is the ability to self-analyze (meta-cognition) to determine low density regions of its model and then to be able to seek out and rectify this (intervention). Data processing is not sufficient for either of those conditions.
Your prompt is, imo, more about world modeling, though I do think this is related. I asked Claude Sonnet 4.5 with extended thinking enabled and it also placed itself outside the room. Opus 4.1 (again with extended thinking), got the answer right. (I don't use a standard system prompt, though that is mostly to make it not syncopathic and to try to get it to ask questions when uncertain and enforce step by step thinking)
From the perspective of the person in the room, your right arm would be on their right side as you walk out.
Here's why: When you initially walk into the room facing the person, your right arm appears on their left side (since you're facing each other). But when you turn around 180 degrees to walk back out, your back is now toward them. Your right arm stays on your right side, but from their perspective it has shifted to their right side.
Think of it this way - when two people face each other, their right sides are on opposite sides. But when one person turns their back, both people's right sides are now on the same side.
The CoT output is a bit more interesting[0]. Disabling my system prompt gives an almost identical answer fwiw. But Sonnet got it right. I repeated the test in incognito after deleting the previous prompts and it continued to get it right, independent of my system prompt or extended thinking.
I don't think this proves a world model though. Misses are more important than hits, just as counter examples are more important than examples in any evidence or proof setting. But fwiw I also frequently ask these models variations on river crossing problems and the results are very shabby. A few appear spoiled now but they are not very robust to variation and that I think is critical.
I think an interesting variation of your puzzle is as follows
Imagine you walked into a room through a doorway. Then you immediately turn around and walk back out of the room.
From the perspective of a person in the room, facing the door, which side would your right arm be? Please explain.
I think Claude (Sonnet) shows some subtle but important results in how it answers
Your right arm would be on their right side.
When you turn around to walk back out, you're facing the same direction as the person in the room (both facing the door). Since you're both oriented the same way, your right side and their right side are on the same side.
This makes me suspect there's some overfitting. CoT correctly uses "I"[1].
It definitely isn't robust to red herrings[2], and I think that's a kicker here. It is similar to failure results I see in any of these puzzles. They are quite easy to break with small variations. And we do need to remember that these are models trained on the entire internet (including HN comments), so we can't presume this is a unique puzzle.
Let me clarify. I was too vague and definitely did not express things accurately. That is on me.
We have the math to show that it can be impossible to distinguish two explanations through data processing alone. We have examples of this in science, a long history of it in fact. Fundamentally there is so much that we cannot conclude from processing data alone. Science (the search of knowledge) is active. It doesn't require just processing existing data, it requires the search for new data. We propose competing hypotheses that are indistinguishable from the current data and seek out the data which distinguishes them (a pain point for many of the TOEs like String Theory). We know that data processing alone is insufficient for explanation. We know it cannot distinguish confounders. We know it cannot distinguish causal graphs (e.g. distinguish triangular maps. We are able to create them, but not distinguish them through data processing alone). The problem with scaling alone is that it makes the assertions that data processing is enough. Yet we have so much work (and history) telling us that data processing is insufficient.
The scaling math itself also shows a drastic decline in performance with scale and often do not suggest convergence even with infinite data. They are power laws with positive concavity, requiring exponential increase in data and parameters for marginal improvements on test loss. I'm not claiming that we need zero test loss to reach AGI, but the results do tell us that if this is strongly correlated then we'll need to spend an exponential amount more to achieve AGI even if we are close. By our measures, scaling is not enough unless we are sufficiently close. Even our empirical results align with this as despite many claiming that scale is all we need, we are making significant changes to the model architectures and training procedures (including optimizers). We are making these large changes because throwing the new data at the old models (even when simply increasing the number of parameters) does not work out. It is not just the practicality, it is the results. The scaling claim has always been a myth used to drive investments since it is a nice simple story that says that we can get there by doing what we've already been doing, just more. We all know that these new LLMs aren't dramatic improvements off their previous versions, despite being much larger, more efficient, and having processed far more data.
[side note]: We even have my namesake who would argue that there are truths which are not provably true with a system that is both consistent and efficient (effectively calculable). But we need not go that far, as omniscience is not a requirement for AGI. Though it is worth noting for the limits of our models, since at the core this matters. Changing our axioms changes the results, even with the same data. But science doesn't exclusively use a formal system, nor does it use a single one.
My apologies for the much delayed reply as I have recently found myself with little extra time to post adequate responses. Your critiques are very interesting to ponder, so I thank you for posting them. I did want to respond to this one though.
I believe all of my counterarguments center around my current viewpoint that given the rapid rate of progress involved on the engineering side, it is no longer reasonable in deep learning theory to consider what is possible, and it is more interesting to try to outline hard limitations. This emposes a stark contrast between deep learning and classical statistics, as the boundaries in the latter are very clear and are not shared by the former.
I want to stress that at present, nearly every conjectured limitation of deep learning over the last several decades has fallen. This includes many back of the napkin, "clearly obvious" arguments, so I'm wary of them now. I think the skepticism all along has been fueled in response to hype cycles, so we must be careful not to make the same mistakes. There is far too much empirical evidence available to counter precise arguments against the claim that there is an underlying understanding within these models, so it seems we must resort to the imprecise to continue the debate.
Scaling, along one axis, suggests a high polynomial degree of additional compute (not exponential) is required for increasing improvements, this is true. But the progress over the last few years has occurred due to the discovery of new axes to scale on, which further reduces the error rate and improves performance. There are still many potential axes left untapped. What is significant about scaling to me is not how much additional compute is required, but the fact that the predicted bottom at the moment is very, very low, far lower than anything else we have ever seen, and that doesn't require any more data than we currently have. That should be cause for concern until we find a better lower bound.
> We all know that these new LLMs aren't dramatic improvements off their previous versions
No, I don't agree. This may be evident to many, but to some, the differences are stark. Our perceived metrics of performance are nonlinear and person-dependent, and these major differences can be imperceptible to most. The vast majority of attempts at providing more regular metrics or benchmarks that are not already saturated have shown that LLM development is not slowing down by any stretch. I'm not saying that LLMs will "go to the moon". But I don't have anything concrete to say they cannot either.
> We have the math to show that it can be impossible to distinguish two explanations through data processing alone.
Actually, this is a really great point, but I think this highlights the limitations of benchmarks and the requirements of capacity-based, compression-based, or other types of alternative data-independent metrics. With these in tow, it can be possible to distinguish two explanations. This could be a fruitful line of inquiry.
I would be ecstatic if it did though, so if anyone has any examples or rebuttal, I would very much appreciate it.