Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

CJK characters are almost always split into multiple tokens per individual character. I'm not too familiar with Unicode mappings so it's interesting that the the outputs are still very coherent.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: