2 hours in and this thread is already stacked, but I'll bite since I am stuck on this problem and need help. I am working on a language learning solution that involves llms. The way I am branding it is "Anki meets Ai" because it combines a flashcard-esque method of generating complete exercises such as multiple choice, cloze, etc. with the tried-and-true SRS methodology.
I think it works great! The problem is, I think it works great. The issue is that it is doubly-lossy in that llms aren't perfect and translating from one language to another isn't perfect either. So the struggle here is in trusting the llm (because it's the only tool good enough for the job other than humans) while trying to look for solid ground so that users feel like they are moving forward and not astray.
- LLMs are better at critiquing translations than producing them (even when thinking, which doesn't actually help!)
- When they make mistakes, the mistakes tend to be different to each other.
So it translates with the top 4-5 models based on my research, then has another model critique, compare, and combine.
It's more expensive than any one model, but it isn't super expensive. The main issue is that it's quite slow. Anyway, hopefully it's useful, and hopefully the data is useful too. Feel free to email/reply if you have any questions/ideas for tests etc.
Sorry, when you said hybrid I was expecting something that was partly an llm and partly something else. How did you arrive at your coherence/idiomaticity/accuracy numbers (if you'll forgive me not delving too deep into the website)?
Hybrid as in a combination of different LLMs. I recommend trying the demo on the site, it should give you an idea of what it's doing. The code is also pretty short.
So those numbers are from an older version of the benchmark.
Coherence is done by:
- Translating English, to the target language, to English
- repeating three times
- Having 3 LLMs score how close the original English is to the new English
I like it because it's robust against LLM bias, but it obviously isn't exact, and I found that after a certain point it's actually negatively correlated with quality, because it incentivises literal, word by word translations.
Accuracy and Idiomaticity are based on asking the judge LLMs to rate by how accurate / idiomatic the translations are. I mostly focused on idiomaticity, as it was the differentiator at the upper end.
The new benchmark has gone through a few iterations, and I'm still not super happy with it. Now it's just based on LLM scoring (this time 0-100), but with better stats, prompting, etc. I've still done some small scale tests on coherence, and I did some more today that I haven't published yet, and again they have DeepL and Lingvanex doing well because they tend towards quite rigid translations over idiomatic ones. Claude 4 is also interestingly doing quite well on those metrics.
I need to sleep, but I can discuss it more tomorrow, if you'd like.
I think it works great! The problem is, I think it works great. The issue is that it is doubly-lossy in that llms aren't perfect and translating from one language to another isn't perfect either. So the struggle here is in trusting the llm (because it's the only tool good enough for the job other than humans) while trying to look for solid ground so that users feel like they are moving forward and not astray.