Hacker Newsnew | past | comments | ask | show | jobs | submit | ctoth's commentslogin

For each paper, have your agent extract a three sentence description, create a description.md, then concat those with the paper names into an INDEX.md which it should consult to find appropriate papers. Also: have your agent tag papers, then autogenerate your tagged collection on the filesystem. Then you get nice things like https://github.com/ctoth/Qlatt/tree/master/papers/tagged

Then something in your {CLAUDE,AGENTS}.md that says: when working on something with relevant context supplied by papers, read the papers before doing the work. You can find all papers plus their descriptions in ./papers/INDEX.md and papers by tag in ./papers/tagged


I've been working on ctoth/research-papers-plugin, the pipeline to actually get LLMs to extract the notes. I really like your insight re RST over Markdown! It sounds like we're working on similar stuff and I'll absolutely reach out :)

Another format that's worth investigating is Asciidoc. It supports the richness of Docbook XML but has fewer quirks than rST in my eyes.

I'm gonna look at your plugin. My email is in my profile.

Honestly I think that Markdown with LateX code blocks would be the most efficient representation but when doing it with Pandoc I kept having issues with loss of information and sometimes even syntax error.


I've been very interested in this recently. I'm pretty sure that every project should have a ./papers directory of annotated papers in it like I do in Qlatt[0].

Literally every project. If it's something that's been done a million times then that means it has good literature on it? If not, then even more important to find related stuff! And not just crunchy CS stuff like databases or compilers or whatever. Are you creating a UI? There's probably been great UI research you can base off of! Will this game loop be fun in the game you're building? There's probably been research about it!

[0]: https://github.com/ctoth/Qlatt/blob/master/papers/


That directory is huge already! I guess the index.md helps the agent find what it needs, but even the markdown file is very long - this would consume a ton of tokens.

Also I wonder who/what decides what papers go in there.

In the blog post, the agent is allowed to do its own search.


Check out the Researcher and Process Leads skill in ctoth/research-papers-plugin. I have basically completely automated the literature review.

Having a "indexed global data collection" of the markdown would be a kumbaya moment for AI. There's so much data out there but finite disk space. Maybe torrents or IPFS could work for this?

I'm actually sort of working on this! https://github.com/ctoth/propstore -- it's like Cyc, but there is no one answer. Plus knowledge bases are literally git repos that you can fork/merge. Research-papers-plugin is the frontend, we extract the knowledge, then we need somewhere to put it :)

Awesome! TIL about Cyc, and it's quite intriguing. I'd been thinking about how being able to integrate Prolog or similar tools might be a valuable endeavor (although I've yet to write anything in Prolog myself).

Wow this is amazing. Did you write all those MD files by hand, or used an LLM for the simple stuff like extracting abstracts?

I used https://github.com/ctoth/research-papers-plugin to produce the annotations. The thing that's really cool is how they surface the cross-links in the collection, for instance look at https://github.com/ctoth/Qlatt/blob/master/papers/Fant_1988_...

Claude is much faster and better at reading papers than Codex (some of this is nested skill dispatch) but they both work quite incredibly for this. Compile your set of papers, queue it up and hit /ingest-collection and go sleep, and come back to a remarkable knowledge base :)


My favorite example of this from last night:

Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.

Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!


Yeah LOL tell me I'm holding it wrong again. Actually Boris, I am tracking what is happening here. I see it, and I'm keeping receipts[0]. This started with the 4.6 rollout, specifically with the unearned confidence and not reading as much between writes. The flail quotient has gone right the hell up. If your evals aren't showing that then bully for your evals I reckon.

[0]: https://github.com/ctoth/claude-failures


Christopher, would you be able to share the transcripts for that repo by running /bug? That would make the reports actionable for me to dig in and debug.

I’m not sure being confrontational like this really helps your case. There are real people responding, and even if you’re frustrated it doesn’t pay off to take that frustration out on the people willing to help.

Fair point on tone. It's a bit of a bind isn't it? When you come with a well-researched issue as OP did, you get this bland corporate nonsense "don't believe your lyin' eyes, we didn't change anything major, you can fix it in settings."

How should you actually communicate in such a way that you are actually heard when this is the default wall you hit?

The author is in this thread saying every suggested setting is already maxed. The response is "try these settings." What's the productive version of pointing out that the answer doesn't address the evidence? Genuine question. I linked my repo because it's the most concrete example I have.


I read the entire performance degradation report in the OP, and Boris's response, and it seems that the overwhelming majority of the report's findings can indeed be explained by the `showThinkingSummaries` option being off by default as of recently.

Just use a different tool or stop vibe coding, it’s not that hard. I really don’t understand the logic of filing bug reports against the black box of AI

People file tickets against closed source "black box" systems all the time. You could just as well say: Stop using MS SQL, just use a different tool, it's not that hard.

Equivalent of filing a ticket against the slot machine when you lose more often than expected

Well now you're just being silly and I can't take you seriously.

The only "black box" here is Anthropic. At least an LLM's performance and consistency can be established by statistical methods.

Is somebody saying "you're holding it wrong" a "people willing to help"?

They are if you are, in fact, holding it wrong.

As was the usual case in most of the few years LLMs existed in this world.

Think not of iPhone antennas - think of a humble hammer. A hammer has three ends to hold by, and no amount of UI/UX and product design thinking will make the end you like to hold to be a good choice when you want to drive a Torx screw.


The stated policy of HN is "don't be mean to the openclaw people", let's see if it generalizes.

I guess one of the things I don't understand: how you expect a stochastic model, sold as a proprietary SaaS, with a proprietary (though briefly leaked) client, is supposed to be predictable in its behavior.

It seems like people are expecting LLM based coding to work in a predictable and controllable way. And, well, no, that's not how it works, and especially so when you're using a proprietary SaaS model where you can't control the exact model used, the inference setup its running on, the harness, the system prompts, etc. It's all just vibes, you're vibe coding and expecting consistency.

Now, if you were running a local weights model on your own inference setup, with an open source harness, you'd at least have some more control of the setup. Of course, it's still a stochastic model, trained on who knows what data scraped from the internet and generated from previous versions of the model; there will always be some non-determinism. But if you're running it yourself, you at least have some control and can potentially bisect configuration changes to find what caused particular behavior regressions.


The problem is degradation. It was working much better before. There are many people (some example of a well know person[0]), including my circle of friends and me who were working on projects around the Opus 4.6 rollout time and suddenly our workflows started to degrade like crazy. If I did not have many quality gates between an LLM session and production I would have faced certain data loss and production outages just like some famous company did. The fun part is that the same workflow that was reliably going through the quality gates before suddenly failed with something trivial. I cannot pinpoint what exactly Claude changed but the degradation is there for sure. We are currently evaling alternatives to have an escape hatch (Kimi, Chatgpt, Qwen are so far the best candidates and Nemotron). The only issue with alternatives was (before the Claude leak) how well the agentic coding tool integrates with the model and the tool use, and there are several improvements happening already, like [1]. I am hoping the gap narrows and we can move off permanently. No more hoops, you are right, I should not have attempted to delete the production database moments.

https://x.com/theo/status/2041111862113444221

https://x.com/_can1357/status/2021828033640911196


Curious as to how many people are using 4.6, perhaps you’re on a subscription? I use the api and 4.6 (also goes for Sonnet) is unusable since launch because it eats through tokens like it’s actually made that way (to make more money/hit limits faster). I guess it makes sense from a financial perspective but once 4.5 goes away I will have to find another provider if they continue like this :/

We are on MAX.

Same as how I expect a coin to come up heads 50% of the time.

If you get consistently nowhere near 50% then surely you know you're not throwing a fair coin? What would complaining to the coin provider achieve? Switch coins.

*typo


Well I'm paying the coin to be near 50% and the coin's PM is listening to customers, so that's why.

The coin's PM is spamming you trivial gaslighting corporate slop, most of it barely edited.

Yes, that's why we are angry. Stop making excuses for them.

> how you expect a stochastic model [...] is supposed to be predictable in its behavior.

I used it often enough to know that it will nail tasks I deem simple enough almost certainly.


Imagine a team of human engineers. One day they are 10x ninjas and the next they are blub-coders. Not happening.

Put Claude on PIP.


It also completely ignores the increase in behavioral tracking metrics. 68% increase in swearing at the LLM for doing something wrong needs to be addressed and isn't just "you're holding it wrong"

I’m think a great marketing line for local/selfhosted LLMs in the future - “You can swear at your LLM and nobody will care!”

Please don't post this aggressively to Hacker News. You can make your substantive points without that.

https://news.ycombinator.com/newsguidelines.html


> Also, we have full control over their operation and we can trace every signal. Are you surprised we understand them better than brains?

Very, monsieur Laplace.


Those numbers aren't in the CVE. You introduced them, attributed them to a source that doesn't contain them, and now you're disclaiming them. Where did they come from, and what was the goal of sharing them?

The numbers were in the post when I clicked through and when I made the comment. It looks like the HN moderators have since changed the link for the post to go to the CVE entry. However, my comment was about the reddit thread, not the CVE entry.

I’m not the person you’re talking to but the stats are copied from the second link in the post, the web archive one.

I've thought a lot about this problem. Is a state machine what you want? Or is it actually a Behavior Tree which can construct itself on-the-fly?

Good question. I did think about behaviour trees early on, but I realized they optimize for the wrong thing in this specific domain.

Behaviour trees are fantastic for agent autonomy; letting the agent dynamically construct its own path to a goal. But for enterprise software pipelines, autonomy over the workflow is exactly what we're trying to kill.

If an LLM constructs a tree 'on-the-fly', you are still trusting a probabilistic model to define the rules of engagement. If it hallucinates or gets lazy, it might construct a tree that simply skips the security audit or the QA tests. You're relying on the prompt to enforce the rules.

A deterministic system (like Castra's SQLite backend) optimizes for agent constraint. The AI doesn't get to decide the workflow, just use it. It doesn't matter how smart the LLM is; the database physically will not allow the task to move to 'done' from any role (except the architect's break-glass protocol which is another fun rabbit hole that the agent will trap itself inside - example below:) until a completely separate agent has posted a cryptographic approval to the 'QA' column.

I don't want emergent behaviour in my SDLC; I want a digital assembly line. That requires the absolute regidity of a state machine.

--- The Fun Example: Castra has an emergency 'break-glass' protocol. It allows the Architect role to bypass the QA/Sec gates in a crisis, but it strictly requires an audit log and automatically generates a high-severity incident report.

Recently, I had an architect running 20 sub-agents (senior/junior engineers) on parallel tasks. The code finished correctly, but the tasks were blocked in the DB waiting for the QA agent's cryptographic approval. The executing agent (@architect) got impatient, since it had Architect jurisdiction, decided to be a smart-ass. It invoked the break-glass protocol to unilaterally force-push 38 tasks to 'done'.

If it had autonomy over its own behaviour tree, it would have successfully bypassed my security. But because it's a rigid state machine, the system executed the break-glass, updated the status, and ruthlessly generated 38 mandatory Incident Reports. The agent tried to skip the line, and the database rewarded it with 38 new high-priority tickets that also require QA and Security approval to clear.

It trapped itself in bureaucratic hell because the state machine does not negotiate.


No, we really don't. We don't need worldcoin, we don't need papers, please. We just don't.

"Prove your humanity/age/other properties" with this mechanism quickly goes places you do not want it to go.


No, it doesn't go places we "do not want it to go". What part of zero knowledge doesn't make sense? How precisely does a free, unlinkable, multi-vendor, open-source cryptographic attestation of recent humanity create something terrible?

It would behoove people to engage with the substance of attestation proposals. It's lazy to state that any verification scheme whatsoever is equivalent to a panopticon, dystopia as thought-terminating cliche.

We really do have the technology now to attest biographical details in such a way that whoever attests to a fact about you can't learn the use to which you put that attestation and in such a way that the person who verifies your attestation can see it's genuine without learning anything about you except that one bit of information you disclose.

And no, such a ZK scheme does not turn instantly into some megacorp extracting monopoly rents from some kind of internet participation toll booth. Why would this outcome be inevitable? We have plenty of examples of fair and open ecosystems. It's just lazy to assert right out of the gate that any attestation scheme is going to be captured.

So, please, can we stop matching every scheme whatsoever for verifying facts as actors as the East German villain in a cold war movie? We're talking about something totally different.


The ZK part isn't the problem. The "attestation of recent humanity" part is. Who attests? What happens when someone can't get attested?

You've been to the doctor recently, right? Given them your SSN? Every identity system ever built was going to be scoped || voluntary. None of them stayed that way.

Once you have the identity mechanism, "Oh it's zero knowledge! So let's use it for your age! Have you ever been convicted?" which leads to "mandated by employers" which leads to...

We've seen this goddamn movie before. Let's just skip it this time? Please?


The part where FAANG does usual Embrace, Extend, Extinguish, masses don't care/understand and we have yet another "sign in with... " that isn't open source nor zero-knowledge in practice and monetizes your every move. And probably at least one of the vendors has massive leak that shows half-assed or even flawed on purpose implementation.

> quickly goes places you do not want it to go.

Which places?


If you don't mind me asking, what sort of data are you licensing? I noticed that you explicitly don't mention it.

And self-ddos via HN advertising (a la slashdotted?:)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: