Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem is that the training data doesn't contain a lot of "I don't know".


The bigger problem is that the benchmarks / multiple-choice tests they are trained to optimize for don't distinguish between a wrong answer and "I don't know". Which is stupid and surprising. There was a thread here on HN about this recently.


That's not important compared to the post-training RL, which isn't "training data".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: