The thing they are testing for is reasoning performance. It makes sense to not give tool access.
This is same as the critiques of the LLM paper by apple where they showed that LLMs fail to solve the tower of hanoi problem after a set number of towers. The test was to see how well these models can reason out a long task. People online were like they could solve that problem if they had access to a coding enviornment. Again the test was to check reasoning capability not if it knew how to code and algorithm to solve the problem.
If model performance degrade a lot after a number of reasoning steps it's good to know where the limits are. Wheather the model had access to tools or not is orthogonal to this problem
This is same as the critiques of the LLM paper by apple where they showed that LLMs fail to solve the tower of hanoi problem after a set number of towers. The test was to see how well these models can reason out a long task. People online were like they could solve that problem if they had access to a coding enviornment. Again the test was to check reasoning capability not if it knew how to code and algorithm to solve the problem.
If model performance degrade a lot after a number of reasoning steps it's good to know where the limits are. Wheather the model had access to tools or not is orthogonal to this problem