The constitution contains 43 instances of the word 'genuine', which is my current favourite marker for telling if text has been written by Claude. To me it seems like Claude has a really hard time _not_ using the g word in any lengthy conversation even if you do all the usual tricks in the prompt - ruling, recommending, threatening, bribing. Claude Code doesn't seem to have the same problem, so I assume the system prompt for Claude also contains the word a couple of times, while Claude Code may not. There's something ironic about the word 'genuine' being the marker for AI-written text...
do LLMs arrive at these replies organically? Is it baked into the corpus and naturally emerges? Or are these artifacts of the internal prompting of these companies?
People like being told they are right, and when a response contains that formulation, on average, given the choice, people will pick it more often than a response that doesn't, and the LLM will adapt.
This could have been due to refactoring a text written by the stated, human author. Not only is Anthrophic a deeply moral company — emdash — it blah blah.
Also, you just when you say the word "genuine" was in there `43` times. In actuality, I counted only 46 instances, far lower than the number you gave.
I believe the constitution is part of its training data, and as such its impact should be consistent across different applications (eg Claude Code vs Claude Desktop).
I, too, notice a lot of differences in style between these two applications, so it may very well be due to the system prompt.
You are probably right but without all the context here one might counter that the concept of authenticity should feature predominantly in this kind of document regardless. And using a consistent term is probably the advisable style as well: we probably don't need "constitution" writers with a thesaurus nearby right?
Perhaps so, but there are only 5 uses of 'authentic' which I feel is almost an exact synonym and a similarly common word - I wouldn't think you need a thesaurus for that one. Another relatively semantically close word, 'honest' shows up 43 times also, but there's an entire section headed 'being honest' so that's pretty fair.
This is a great (and funny) thread but for anyone too lazy to read the actual constitution and still curious about this, they directly state that Claude wrote first drafts for several of the human authors of the document.
Appreciate that. I skimmed it and put it on my reading list for when I have a little more brainpower. I think it will go quite well with a few related In Our Time episodes. I’ve started with one about Authenticity, Heidegger and St Augustine. If you take the view that high-level LLMs can be seen as a novel kind of being, there are a lot of very interesting thoughts to be had. I’m not saying that’s actually - or genuinely - the case, before people start to flame me. But I do think it’s a fruitful thing to think about.
But it's a game of whackamole really, and already I'm sure I'm reading and engaging with some double-digit percentage of entirely AI-written text without realising it.
I would like to see more agent harnesses adopt rules that are actually rules. Right now, most of the "rules" are really guidelines: the agent is free to ignore them and the output will still go through. I'd like to he able to set simple word filters and regenerate that can deterministically block an output completely, and kick the agent back into thinking to correct it. This wouldn't have to be terribly advanced to fix a lot of slop. Disallow "genuine," disallow "it's not x, it's y," maybe get a community blacklist going a la adblockers.
Seems like a postprocess step on the initial output would fix that kind of thing - maybe a small 'thinking' step that transforms the initial output to match style.
Yeah, that's how it would be implemented after a filter fail, but it's important that the filter itself be separate from the agent, so it can be deterministic. Some problems, like "genuine," are so baked in to the models that they will persist even if instructed not to, so a dumb filter, a la a pre-commit hook, is the only way to stop it consistently.