Hacker Newsnew | past | comments | ask | show | jobs | submit | nyellin's commentslogin

We publish the benchmarks for HolmesGPT (CNCF sandbox project) at https://holmesgpt.dev/development/evaluations/

HolmesGPT maintainer here: our benchmarks [1] tell a very different story, as does anecdotal evidence from our customers- including Fortune 500 using SRE agents in incredibly complex production environments.

We're actually struggling a bit with benchmark saturation right now. Opus does much better in the real world than Sonnet but it's hard to create sophisticated enough benchmarks to show that in the lab. When we run benchmarks with a small number of iterations Sonnet even wins sometimes.

[1] https://holmesgpt.dev/development/evaluations/history/


Haiku is called often, but not always the way you think. E.g. every time you write something CC invokes Haiku multiple times to generate the 'delightful 1-2 word phrase used to indicate progress to the user' (Doing Stuff, Wizarding, etc)


it’s also used in the Explore agent and for other things too


Not necessarily true. Subagents allow for parallelization but they can decrease accuracy dramatically if you're not careful because there are often dependencies between tasks and swapping context windows with a summary is extremely lossy.

For the longest time, Claude Code itself didnt really use subagents much by default, other than supporting them as a feature eager users could configure. (Source is reverse engineering we did on Claude code using the fantastic CC tracing tool Simon Willison wrote about once. This is also no longer true on latest versions that have e.g. an Explore subagent that is actively used.)


You’re right that subagents were more likely to cause issues than be helpful. But, when properly understood, lead to so much time saved through parallelization for tasks that warranted it.

I was having codex organize my tv/movie library the other day by having it generate. most of the files were not properly labeled. I had codex generate transcripts, manually search the movie db to find descriptions of show episodes, and match the show descriptions against the transcripts to figure out which episode/season the show belonged to.

Claude Code could have parallelized those manual checks and finished that task at 8x the speed.


Forgot to address the easiest part:

> - how can I reliably call tools with the right schema?

This is typically done by enabling strict mode for tool calling which is a hermetic solution. Makes llm unable to generate tokens that would violate the schema. (I.e. LLM samples tokens only from the subset of tokens that lead to valid schema generation.)


Re (1) use a TODOs system like Claude code.

Re (2) also fairly easy! It's just a summarization prompt. E.g. this is the one we use in our agent: https://github.com/HolmesGPT/holmesgpt/blob/62c3898e4efae69b...

Or just use the Claude Code SDK that does this all for you! (You can also use various provider-specific features for 2 like automatic compaction on OpenAI responses endpoint.)


There's a bit more to it!

For example, the agent in the post will demonstrate 'early stopping' where it finishes before the task is really done. You'd think you can solve this with reasoning models, but it doesn't actually work on SOTA models.

To fix 'early stopping' you need extra features in the agent harness. Claude Code does this with TODOs that are injected back into every prompt to remind the LLM what tasks remain open. (If you're curious somewhere in the public repo for HolmesGPT we have benchamrks with all the experiments we ran to solve this - from hypothesis tracking to other exotic approaches - but TODOs always performed best.)

Still, good article. Agents really are just tools in a loop. It's not rocket science.


Yes this “premature termination”, becomes particularly evident when you switch out Opus/Sonnet with a weaker LLM, and also happens more often in Codex CLI with GPT-5.

Since one of the replies asked for an example: the agent works for a bit and just stops. We’ve all seen cases where the agent simply says “ok, let me read the blah.py to understand the context better”, and just stops. It has essentially forgotten to use a tool for its next edit or read etc.


why would it early stop? examples?


Models just naturally arrive at a conclusion that they are done. TODO hints can help, but is not infallible: Claude will stop and happily report there's more work to be done and "you just say the word Mister and I'll continue" --- this is a RL problem where you have to balance the chance of an infinite loop (it keeps thinking there's a little bit more to do when there is not) versus the opposite where it stops short of actual completion.


> this is a RL problem where you have to balance the chance of an infinite loop (it keeps thinking there's a little bit more to do when there is not) versus the opposite where it stops short of actual completion.

Any idea on why the other end of the spectrum is this way -- thinking that it always has something to do?

I can think of a pet theory on it stopping early -- that positive tool responses and such bias it towards thinking it's complete (could be extremely wrong)


My pet theory: LLM's are good at detecting and continuing patterns. Repeating the same thing is a rather simple pattern, and there's no obvious place to stop if an LLM falls into that pattern unintentionally. At least to an unsophisticated LLM, the most likely completion is to continue the pattern.

So infinite loops are more of a default, and the question is how to avoid them. Picking randomly (non-zero temperature) helps prevent repetition sometimes. Other higher-level patterns probably prevent this from happening most of the time in more sophisticated LLM's.


> Any idea on why the other end of the spectrum is this way -- thinking that it always has something to do?

Who said anything about "thinking"? Smaller models were notorious for getting stuck repeating a single word over and over, or just "eeeeeee" forever. Larger models only change probabilities, not the fundamental nature of the machine.


Not all models are trained with long one-shot task following by themselves, seems many of them prefer closer interactions with the user. You could always add another layer/abstraction above/below to work around it.


Can't this just be a Ralph Wiggum loop (i.e. while True)


Sure, but I think just about everyone wants the agent to eventually say "done" in one way or another.


I know there are already a number of comments here about proprietary solutions.

If you're looking for something open source: https://github.com/robusta-dev/holmesgpt/


We've open sourced something with similar goals that you can use today: https://github.com/robusta-dev/holmesgpt/

We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.

What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.

If anyone is curious, I did a webinar with PagerDuty on this recently.



Can we see the recording of this webinar somewhere?



Robusta.dev | REMOTE (EUROPE) or ONSITE(ISRAEL)| Staff Software Engineer, Backend Team Lead

We investigate cloud alerts with LLMs - see http://github.com/robusta-dev/holmesgpt/

Email natan at our domain


Chatted with these folks a few times and they are all lovely people and would be super fun to work with.


Thank you!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: