How are you evaluating LLM answers are right or wrong? Because I saw some wrong ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		TuringNYC on Sept 2, 2023 \| parent \| context \| favorite \| on: A GPT-4 capability forecasting challenge How are you evaluating LLM answers are right or wrong? Because I saw some wrong answers that were right and potentially right that were wrong. Are you just looking for keywords, etc? Or is this all run beforehand and graded by humans?

IAmGraydon on Sept 2, 2023 [–]

Yeah this is completely broken judging from my experience. Often GPT would get the answer wrong and the site would claim that it was correct.

TuringNYC on Sept 2, 2023 | [–]

Anyone know a good method to judge right/wrong answers from LLMs? I can see keyword solutions to be brittle. Perhaps another LLM?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact