From my perspective as a potential user the number of attempts is the number of ...

mcintyre1994 · 2025-08-12T13:20:46 1755004846

I think it depends on "But the top rated submissions aren’t running production products" It sounds like they're shipping a product without the debug agent/try-again logic, and that's just for the benchmark, so you wouldn't get the performance they get as a user.

radarsat1 · 2025-08-12T14:30:14 1755009014

I was thinking about this recently with respect to how many agent systems now let you specify a smaller/faster model for easier tasks and a bigger model for harder tasks.

It's interesting to think about what the trade-offs are. Assuming the system can properly classify a task as easy or hard (big "if" but I guess there are ways), there is nonetheless more to think about, depending on your pricing plan.

For subscription pricing, I guess you don't really care which model runs and in fact it's hard to find a reason to ever run the smaller model, so choosing between the models is more in the provider's interests for cost efficiency.

But for pay-per-use pricing, But if you have a bigger model that can get the answer right 80% of the time, and a smaller model that can handle smaller changes and get things right 60% of the time but correct its mistakes, then the system should try to run it on as many tasks as possible to save you money.. but in the end if ends up having to make a lot of corrections, then maybe you end up needing more total requests than the larger model. In that case maybe it's actually cheaper to run the larger model, if it takes fewer requests.

So I wonder how that kind of trade-off could be effectively calculated. I guess if you can figure out when "retries" happen you can count them and do some statistics on which model is more likely to work out in fewer shots. It's pretty complicated though, when you start to think about it in detail.

I do wonder if even having BOTH the smaller and bigger model make hypotheses, and try the smaller model's idea first, then if it fails, try the bigger model's idea, might be the way to go.

gronky_ · 2025-08-12T13:29:29 1755005369

This ok from your perspective then?

def make_pass@1_agent(agent, n):

    def retry_agent(problem):

        for attempt in range(n):

            result = agent(problem)

            if result.success:

                return result

        return result

    return retry_agent

terminalshort · 2025-08-12T15:08:38 1755011318

Definitely wouldn't have written the code that way, but yes, if (and this is a massive "if") the agent has an accurate and meaningful way to determine which way to set the success boolean. The obvious caveat would be if n needed to be large enough to set the costs higher than I am willing to pay for the additional performance or it makes it take longer than I'm willing to wait.

Think of the agent like an employee. If he delivers the code within the expected time and to the expected quality standards, his process of getting there means almost nothing. Do I care if he tried 4 different approaches along the way and threw out the first 3? Not a bit.

DougBTX · 2025-08-12T13:49:32 1755006572

Absolutely fine, as long as the success flag is predicted by the model ensemble under test. That’s how Claude Code works for example, it will continue to iterate until success (or it will give up with failure at a certain point).

gronky_ · 2025-08-12T13:37:21 1755005841

Keep in mind that this isn’t about users - the top agents on the leaderboard aren’t running an actual product on the benchmark.

If they are running their production product as is, then of course whatever is built into the product is fine.

whymauri · 2025-08-12T14:18:25 1755008305

Papers have been doing rollouts that involve a model proposing N solutions and then self-reviewing to choose the best one (prior to the verifier). So far, I think that's been counted as one pass.