Hacker Newsnew | past | comments | ask | show | jobs | submit | olliestanley's commentslogin

Difficult one. GSM8K and MATH evals (both reported in Reasoning Gym paper) are common in smaller model RL papers for a reason, which is that smaller models can get decent scores on them, unlike fresher & harder benchmarks.

Part of the aim of RG is to be used as a difficulty-adjustable & non-repeating eval though so if people think it's a good benchmark, perhaps it will allow this status quo to shift!


We definitely plan to maintain the project for as long as there is interest in it. If you have ideas for new tasks, we'd always welcome contributions!


Thanks for the answer! As a toy project I implemented wikiracing with trl. I'll probably try to PR that to your gym. (can't say that I managed to improve score with it though)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: