Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How are you evaluating LLM answers are right or wrong? Because I saw some wrong answers that were right and potentially right that were wrong. Are you just looking for keywords, etc? Or is this all run beforehand and graded by humans?


Yeah this is completely broken judging from my experience. Often GPT would get the answer wrong and the site would claim that it was correct.


Anyone know a good method to judge right/wrong answers from LLMs? I can see keyword solutions to be brittle. Perhaps another LLM?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: