For non thinking/agentic models, they must 1-shot the answer. So every token it outputs is part of the response, even if it's wrong.
This is why people are getting different results with thinking models -- it's as if you were going to be asked ANY question and need to give the correct answer all at once, full stream-of-consciousness.
Yes there are perverse incentives, but I wonder why these sorts of models are available at all tbh.