Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> based on both frontier model performance in high-level math and CS competitions

IMO the only takeaway from those successes is that RL for reasoning works when you have a clear reward signal. Whether this RL-based approach to reasoning can be made to work in more general cases remains to be seen.

There is also a big disconnect between how these models do so well in benchmark tasks like these that they've been specifically trained for, and how easily they still fail in everyday tasks. Yesterday I had the just released Sonnet 4.5 fail to properly do a units conversion from radians to arcsec as part of a simple problem - it was off by a factor of 3. Not exactly a PhD level math performance!



I mean, I agree. There is not yet a clear path/story as to how a model can provide a consistently expert-performance on real-world tasks, and the various breakthroughs we hear about don't address that. I think the industry consensus is more just that we haven't correctly measured/targeted those abilities yet, and there is now a big push to do so. We'll see if that works out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: