olliestanley's comments

olliestanley · 2025-06-02T19:59:26 1748894366

Difficult one. GSM8K and MATH evals (both reported in Reasoning Gym paper) are common in smaller model RL papers for a reason, which is that smaller models can get decent scores on them, unlike fresher & harder benchmarks.

Part of the aim of RG is to be used as a difficulty-adjustable & non-repeating eval though so if people think it's a good benchmark, perhaps it will allow this status quo to shift!

olliestanley · 2025-06-02T10:58:26 1748861906

We definitely plan to maintain the project for as long as there is interest in it. If you have ideas for new tasks, we'd always welcome contributions!

phh · 2025-06-02T12:30:37 1748867437

Thanks for the answer! As a toy project I implemented wikiracing with trl. I'll probably try to PR that to your gym. (can't say that I managed to improve score with it though)