True. See one of Anthropic's researcher's comment for a great example of that. It's likely that "planning" inherently exists in the raw LLM and RL is just bringing it to the forefront.
I just think it's helpful to understand that all of these models people are interacting with were trained with the _explicit_ goal of maximizing the probabilities of responses _as a whole_, not just maximizing probabilities of individual tokens.
I just think it's helpful to understand that all of these models people are interacting with were trained with the _explicit_ goal of maximizing the probabilities of responses _as a whole_, not just maximizing probabilities of individual tokens.