I built this benchmark this month: https://github.com/lechmazur/sycophancy. There are large differences between LLMs. There are large differences between LLMs. For example, Mistral Large 3 and GPT-4.1 will initially agree with the narrator, while Gemini will disagree. I swap sides, so this is not about possible viewpoint bias in the LLMs. But another benchmark shows that Gemini will then change its view very easily in a multi-turn conversation while Kimi K2.5 or Grok won't: https://github.com/lechmazur/persuasion.
I built two related benchmarks this month: https://github.com/lechmazur/sycophancy and https://github.com/lechmazur/persuasion. There are large differences between LLMs. For example, good luck getting Grok to change its view, while Gemini 3.1 Pro will usually disagree with the narrator at first but then change its position very easily when pushed.
I've made top-10 lists of LLMs' favorite names to use in creative writing here: https://x.com/LechMazur/status/2020206185190945178. They often recur across different LLMs. For example, they love Elara and Elias.
reply