More

nyellin · 2026-01-29T20:43:11 1769719391

We publish the benchmarks for HolmesGPT (CNCF sandbox project) at https://holmesgpt.dev/development/evaluations/

nyellin · 2026-01-29T20:38:52 1769719132

HolmesGPT maintainer here: our benchmarks [1] tell a very different story, as does anecdotal evidence from our customers- including Fortune 500 using SRE agents in incredibly complex production environments.

We're actually struggling a bit with benchmark saturation right now. Opus does much better in the real world than Sonnet but it's hard to create sophisticated enough benchmarks to show that in the lab. When we run benchmarks with a small number of iterations Sonnet even wins sometimes.

[1] https://holmesgpt.dev/development/evaluations/history/

nyellin · 2026-01-08T20:58:59 1767905939

Haiku is called often, but not always the way you think. E.g. every time you write something CC invokes Haiku multiple times to generate the 'delightful 1-2 word phrase used to indicate progress to the user' (Doing Stuff, Wizarding, etc)

dkdcio · 2026-01-08T21:41:21 1767908481

it’s also used in the Explore agent and for other things too

nyellin · 2026-01-08T20:56:42 1767905802

Not necessarily true. Subagents allow for parallelization but they can decrease accuracy dramatically if you're not careful because there are often dependencies between tasks and swapping context windows with a summary is extremely lossy.

For the longest time, Claude Code itself didnt really use subagents much by default, other than supporting them as a feature eager users could configure. (Source is reverse engineering we did on Claude code using the fantastic CC tracing tool Simon Willison wrote about once. This is also no longer true on latest versions that have e.g. an Explore subagent that is actively used.)

prodigycorp · 2026-01-08T21:08:33 1767906513

You’re right that subagents were more likely to cause issues than be helpful. But, when properly understood, lead to so much time saved through parallelization for tasks that warranted it.

I was having codex organize my tv/movie library the other day by having it generate. most of the files were not properly labeled. I had codex generate transcripts, manually search the movie db to find descriptions of show episodes, and match the show descriptions against the transcripts to figure out which episode/season the show belonged to.

Claude Code could have parallelized those manual checks and finished that task at 8x the speed.

nyellin · 2026-01-08T20:52:37 1767905557

Forgot to address the easiest part:

> - how can I reliably call tools with the right schema?

This is typically done by enabling strict mode for tool calling which is a hermetic solution. Makes llm unable to generate tokens that would violate the schema. (I.e. LLM samples tokens only from the subset of tokens that lead to valid schema generation.)

nyellin · 2026-01-08T20:50:42 1767905442

Re (1) use a TODOs system like Claude code.

Re (2) also fairly easy! It's just a summarization prompt. E.g. this is the one we use in our agent: https://github.com/HolmesGPT/holmesgpt/blob/62c3898e4efae69b...

Or just use the Claude Code SDK that does this all for you! (You can also use various provider-specific features for 2 like automatic compaction on OpenAI responses endpoint.)

nyellin · 2026-01-08T20:45:13 1767905113

There's a bit more to it!

For example, the agent in the post will demonstrate 'early stopping' where it finishes before the task is really done. You'd think you can solve this with reasoning models, but it doesn't actually work on SOTA models.

To fix 'early stopping' you need extra features in the agent harness. Claude Code does this with TODOs that are injected back into every prompt to remind the LLM what tasks remain open. (If you're curious somewhere in the public repo for HolmesGPT we have benchamrks with all the experiments we ran to solve this - from hypothesis tracking to other exotic approaches - but TODOs always performed best.)

Still, good article. Agents really are just tools in a loop. It's not rocket science.

d4rkp4ttern · 2026-01-09T02:38:26 1767926306

Yes this “premature termination”, becomes particularly evident when you switch out Opus/Sonnet with a weaker LLM, and also happens more often in Codex CLI with GPT-5.

Since one of the replies asked for an example: the agent works for a bit and just stops. We’ve all seen cases where the agent simply says “ok, let me read the blah.py to understand the context better”, and just stops. It has essentially forgotten to use a tool for its next edit or read etc.

rtgfhyuj · 2026-01-08T21:29:31 1767907771

why would it early stop? examples?

mickeyp · 2026-01-09T06:28:13 1767940093

Models just naturally arrive at a conclusion that they are done. TODO hints can help, but is not infallible: Claude will stop and happily report there's more work to be done and "you just say the word Mister and I'll continue" --- this is a RL problem where you have to balance the chance of an infinite loop (it keeps thinking there's a little bit more to do when there is not) versus the opposite where it stops short of actual completion.

wxce · 2026-01-09T16:53:47 1767977627

> this is a RL problem where you have to balance the chance of an infinite loop (it keeps thinking there's a little bit more to do when there is not) versus the opposite where it stops short of actual completion.

Any idea on why the other end of the spectrum is this way -- thinking that it always has something to do?

I can think of a pet theory on it stopping early -- that positive tool responses and such bias it towards thinking it's complete (could be extremely wrong)

skybrian · 2026-01-10T17:09:54 1768064994

My pet theory: LLM's are good at detecting and continuing patterns. Repeating the same thing is a rather simple pattern, and there's no obvious place to stop if an LLM falls into that pattern unintentionally. At least to an unsophisticated LLM, the most likely completion is to continue the pattern.

So infinite loops are more of a default, and the question is how to avoid them. Picking randomly (non-zero temperature) helps prevent repetition sometimes. Other higher-level patterns probably prevent this from happening most of the time in more sophisticated LLM's.

yencabulator · 2026-01-09T21:33:14 1767994394

> Any idea on why the other end of the spectrum is this way -- thinking that it always has something to do?

Who said anything about "thinking"? Smaller models were notorious for getting stuck repeating a single word over and over, or just "eeeeeee" forever. Larger models only change probabilities, not the fundamental nature of the machine.

embedding-shape · 2026-01-08T21:35:44 1767908144

Not all models are trained with long one-shot task following by themselves, seems many of them prefer closer interactions with the user. You could always add another layer/abstraction above/below to work around it.

fastball · 2026-01-09T00:38:36 1767919116

Can't this just be a Ralph Wiggum loop (i.e. while True)

embedding-shape · 2026-01-09T09:23:57 1767950637

Sure, but I think just about everyone wants the agent to eventually say "done" in one way or another.

nyellin · on Nov 20, 2024

I know there are already a number of comments here about proprietary solutions.

If you're looking for something open source: https://github.com/robusta-dev/holmesgpt/

nyellin · on Aug 23, 2024

We've open sourced something with similar goals that you can use today: https://github.com/robusta-dev/holmesgpt/

We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.

What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.

If anyone is curious, I did a webinar with PagerDuty on this recently.

BodyCulture · on Aug 23, 2024

https://news.ycombinator.com/item?id=41327430

BodyCulture · on Aug 23, 2024

Can we see the recording of this webinar somewhere?

nyellin · on Aug 23, 2024

Here you go: https://www.youtube.com/live/Jml1hk6I5Wo?si=YbjJKRkO4yf0bOlx

And thanks for submitting!

nyellin · on July 2, 2024

Robusta.dev | REMOTE (EUROPE) or ONSITE(ISRAEL)| Staff Software Engineer, Backend Team Lead

We investigate cloud alerts with LLMs - see http://github.com/robusta-dev/holmesgpt/

Email natan at our domain

walth · on July 2, 2024

Chatted with these folks a few times and they are all lovely people and would be super fun to work with.

nyellin · on July 7, 2024

Thank you!