More

zone411 · 2026-04-09T18:25:49 1775759149

https://variety.com/2020/digital/news/twitter-unblocks-new-y...

hn_acc1 · 2026-04-09T19:01:49 1775761309

How are those "conservative opinions"? Are you saying the whole thing was right-wing fan-fiction?

zone411 · 2026-03-28T16:51:13 1774716673

I built this benchmark this month: https://github.com/lechmazur/sycophancy. There are large differences between LLMs. There are large differences between LLMs. For example, Mistral Large 3 and GPT-4.1 will initially agree with the narrator, while Gemini will disagree. I swap sides, so this is not about possible viewpoint bias in the LLMs. But another benchmark shows that Gemini will then change its view very easily in a multi-turn conversation while Kimi K2.5 or Grok won't: https://github.com/lechmazur/persuasion.

zone411 · 2026-03-28T16:45:52 1774716352

I built two related benchmarks this month: https://github.com/lechmazur/sycophancy and https://github.com/lechmazur/persuasion. There are large differences between LLMs. For example, good luck getting Grok to change its view, while Gemini 3.1 Pro will usually disagree with the narrator at first but then change its position very easily when pushed.

zone411 · 2026-03-23T21:10:35 1774300235

Hmm, maybe in the next edition, Opus gets expensive. I should probably run GPT-5.4 xhigh too if I do that for fairness...

zone411 · 2026-03-16T19:06:18 1773687978

Rationalists were right about everything that mattered: crypto, AI, COVID... HN commentators, by contrast, were wrong about everything that mattered.

zone411 · 2026-03-05T21:06:22 1772744782

Results from my Extended NYT Connections benchmark:

GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).

GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).

GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).

stavros · 2026-03-06T00:45:02 1772757902

How do you score this? Losing/winning the game with 4 lives?

kinderjaje · 2026-03-06T07:26:03 1772781963

I added that info on https://automatio.ai/models/gpt-5-4

oliwary · 2026-03-06T06:07:12 1772777232

Impressive! Do you include puzzles released before the training data cutoff date?

zone411 · 2026-02-25T20:30:19 1772051419

I've made top-10 lists of LLMs' favorite names to use in creative writing here: https://x.com/LechMazur/status/2020206185190945178. They often recur across different LLMs. For example, they love Elara and Elias.

zone411 · 2026-02-17T21:34:18 1771364058

They're improved compared to 4.5 on my Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/).

Sonnet 4.6 Thinking 16K scores 57.6 on the Extended NYT Connections Benchmark. Sonnet 4.5 Thinking 16K scored 49.3.

Sonnet 4.6 No Reasoning scores 55.2. Sonnet 4.5 No Reasoning scored 47.4.

rmi_ · 2026-02-18T08:25:57 1771403157

Thanks! I really like your benchmark.

Why is GLM-5 x's, though?

zone411 · 2026-01-21T01:38:13 1768959493

For people interested in these kinds of benchmarks, I have two multiplayer, multi-round games:

- Elimination Game Benchmark: Social Reasoning, Strategy, and Deception in Multi-Agent LLM Dynamics at https://github.com/lechmazur/elimination_game/

- Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure at https://github.com/lechmazur/step_game/

zone411 · 2025-12-17T21:29:54 1766006994

Scores 92.0 on my Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/). Gemini 2.5 Flash scored 25.2, and Gemini 3 Pro scored 96.8.