More info in the GitHub repo, in the reports folder (sorry, I'm not sure I can add the link here without being flagged).
"Codex + Edgee consumes roughly half the fresh tokens of the normal Codex baseline. Output tokens are marginally higher (+3,312, +19.5%), suggesting the Edgee scenario produces slightly more verbose responses but dramatically reduces context ingestion."
I think the problem being given to Codex for the benchmark is the one in the attached video, where two Codex run side-by-side, working a "standard" dev thingy