I’ve been running a bunch of coding agents on benchmarks recently as part of consulting, and this is actually much more impressive than it seems at first glance.
71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.
But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.
Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.
If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.
Right. Building a custom setup is blatant- that will wildly overfit.
But let's say a group uses it as a metric as part of CI and each new idea / feature they create runs against SWE bench. Maybe they have parameterized bits and pieces they adjust, maybe they have multiple candidates datasets for fine tuning, maybe they're choosing between checkpoints.
This will also end up overfitting - especially if done habitually. It might be a great metric and result in a more powerful overall model. Or it might not.
IIRC the SWE bench dataset gives you the full repo snapshot + the issue text, the evaluation pipelines typically run some kind of retriever (eg. grep, BM25) to pick a subset of files to place in the model’s context. They provided context is usually limited up to ~50k tokens.
Is there something in this multi-agent approach that makes the setup more specific to just the test at hand and less general to real engineering tasks? If not, then this multi-agent system will just become what you get out of the box in a future product. Multiple attempts per problem (as long as there's no human intervention or selection between them) is a perfectly fine approach for agents because that's not an issue from the perspective of an engineer using the product. A single agent is already a multi-step usage of LLMs and it sounds like this is just another meta level of that.
From my perspective as a potential user the number of attempts is the number of times I have to tell it what to do. If you have an agent that makes a single attempt and is 60% accurate vs another that makes 5 attempts and is 80% accurate, why would you care that each individual attempt of the 2nd model is less accurate than the first?
I think it depends on "But the top rated submissions aren’t running production products" It sounds like they're shipping a product without the debug agent/try-again logic, and that's just for the benchmark, so you wouldn't get the performance they get as a user.
I was thinking about this recently with respect to how many agent systems now let you specify a smaller/faster model for easier tasks and a bigger model for harder tasks.
It's interesting to think about what the trade-offs are. Assuming the system can properly classify a task as easy or hard (big "if" but I guess there are ways), there is nonetheless more to think about, depending on your pricing plan.
For subscription pricing, I guess you don't really care which model runs and in fact it's hard to find a reason to ever run the smaller model, so choosing between the models is more in the provider's interests for cost efficiency.
But for pay-per-use pricing, But if you have a bigger model that can get the answer right 80% of the time, and a smaller model that can handle smaller changes and get things right 60% of the time but correct its mistakes, then the system should try to run it on as many tasks as possible to save you money.. but in the end if ends up having to make a lot of corrections, then maybe you end up needing more total requests than the larger model. In that case maybe it's actually cheaper to run the larger model, if it takes fewer requests.
So I wonder how that kind of trade-off could be effectively calculated. I guess if you can figure out when "retries" happen you can count them and do some statistics on which model is more likely to work out in fewer shots. It's pretty complicated though, when you start to think about it in detail.
I do wonder if even having BOTH the smaller and bigger model make hypotheses, and try the smaller model's idea first, then if it fails, try the bigger model's idea, might be the way to go.
Definitely wouldn't have written the code that way, but yes, if (and this is a massive "if") the agent has an accurate and meaningful way to determine which way to set the success boolean. The obvious caveat would be if n needed to be large enough to set the costs higher than I am willing to pay for the additional performance or it makes it take longer than I'm willing to wait.
Think of the agent like an employee. If he delivers the code within the expected time and to the expected quality standards, his process of getting there means almost nothing. Do I care if he tried 4 different approaches along the way and threw out the first 3? Not a bit.
Absolutely fine, as long as the success flag is predicted by the model ensemble under test. That’s how Claude Code works for example, it will continue to iterate until success (or it will give up with failure at a certain point).
Papers have been doing rollouts that involve a model proposing N solutions and then self-reviewing to choose the best one (prior to the verifier). So far, I think that's been counted as one pass.
We need some international body to start running these tests… I just can’t trust these numbers any longer. We need a platform for this, something at least we can get some peer reviews
I’m working on this at STAC Research and looking to connect with others interested in helping. Key challenges are ensuring impartiality (and keeping it that way), making benchmarks ungameable, and guaranteeing reproducibility. We’ve done similar work in finance and are now applying the same principles to AI.
Sure! STAC Research has been building and running benchmarks in finance for ~18 years. We’ve had to solve many of the same problems I think you’re highlighting here.. e.g. tech & model providers tuning specifically for the benchmark, results that get published but can’t be reproduced outside the provider’s lab, etc.
The approach is to use workloads defined by developers and end users (not providers) that reflect their real-world tasks. E.g. in finance, delivering market snapshots to trading engines. We test full stacks, holding some layers constant so you can isolate the effect of hardware, software, or models. Every run goes through an independent third-party audit to ensure consistent conditions, no cherry-picking of results, and full disclosure of config and tuning, so that the results are reproducible and the comparisons are fair.
In finance, the benchmarks are trusted enough to drive major infrastructure decisions by the leading banks and hedge funds, and in some cases to inform regulatory discussions, e.g. around how the industry handles time synchronization.
Now starting to apply the same principles to the AI benchmarking space. Would love to talk to anyone who wants to be involved?
I think those wrappers could create some potentially complex workflow around LLM API, with various trees of decisions, integrations, eval, rankers, ratets, etc, and this is their added value.
I've been using Warp for the past few weeks and it's been incredibly impressive over other agentic coding services/platforms. Curious how Qodo stacks up.
When I tried warp I was convinced that was where the industry was going (agents as terminal replacement), but it felt a bit too heavy to me so I haven’t been using it lately. Still think all things will converge on terminal and browser replacement.
I feel like the bash only SWE Bench Verified (a.k.a model + mini-swe-agent) is the closest thing to measuring the inherent ability of the model vs. the scaffolding.
There's swe-rebench, where they take "bugs/issues" by date, and you can drag a slider on their top scores to see issues solved after the model was released (obviously only truly working for open models).
If it's really better than Claude Code while using Sonnet 4.0, then I'd pay a monthly fee for it, but only if I can use my Claude subscription the same way Claude Code does.
I do not want to pay API charges or be limited to a fixed number of "credits" per month.
Slick. This applies to the new Qodo Command CLI, yes?
I updated to the latest version last night. Enjoyed seeing the process permission toggle (rwx). Was a refreshing change to keep the security minded folks less in panic with all the agentic coding adoptions :-)
I would be more interested in Qodo's performance on the swe-bench-multilingual benchmark. Swe-bench-verified only includes bugs related to python breakages.
The best submission is swe-bench-multilingual is Claude 3.7 Sonnet which solves ~43% of the issues in the dataset.
Does anyone have a benchmark on the effectiveness of using embeddings for mapping bug reports to code files as opposed to extensive grepping as Qodo, Cursor and a number of tools I use do to localize faults?
If Qodo are reading this: please introduce a plan that isn't for teams or enterprise. A "pro" plan for individuals who want more than 250 credits per month.
71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.
But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.
Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.
If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.