Our goal is to have a good recall, sometimes to the detriment of precision, so for words with multiple meanings, it might consider them toxic when in the actual context they are used in, they are not. The toxicity mitigation algorithm will search for alternative translations that have the correct meaning but not the potentially toxic word so that there is no added toxicity in the output. This means that sometimes the model might prefer a less coloquial phrasing than what a human would.
You can find details on how the multi-language creation of the toxicity lists was done in section 7.3 of the NLLB paper: https://arxiv.org/pdf/2207.04672.pdf. TLDR: it's not just a translation of a base English list, even if we started from that, each language has a curated list that was built by professional translators.
You can find details on how the multi-language creation of the toxicity lists was done in section 7.3 of the NLLB paper: https://arxiv.org/pdf/2207.04672.pdf. TLDR: it's not just a translation of a base English list, even if we started from that, each language has a curated list that was built by professional translators.