No discussion on problem difficulty, or on result quality besides "the Edgee run...

sachamorard · 2026-04-09T17:32:47 1775755967

More info in the GitHub repo, in the reports folder (sorry, I'm not sure I can add the link here without being flagged).

"Codex + Edgee consumes roughly half the fresh tokens of the normal Codex baseline. Output tokens are marginally higher (+3,312, +19.5%), suggesting the Edgee scenario produces slightly more verbose responses but dramatically reduces context ingestion."

kokakiwi · 2026-04-09T17:38:22 1775756302

I think the problem being given to Codex for the benchmark is the one in the attached video, where two Codex run side-by-side, working a "standard" dev thingy