ARC-AGI v3 is a pretty good benchmark, and it's notably different from the other ARC-AGI in that it has a "truer" human baseline (you can go play it right now and add your datapoint), and captures the act of in-context learning better as you start an unfamiliar game then master it over time.
Also bottom 10% feels like a bad comparison, median human would be better. And unlike "specialized" things like programming, game playing is something almost all of us have done.
Also bottom 10% feels like a bad comparison, median human would be better. And unlike "specialized" things like programming, game playing is something almost all of us have done.