Personally I don't find UTF-16 to be too bad. It's a simple encoding and very ea...

jeltz · on April 8, 2021

The bad thing with UTF-16 is that so much software assumes that one code point always is 16 bits.

jrochkind1 · on April 8, 2021

Which is not UTF-16 at all, UTF-16 standard clearly says this is not so. So why do they do that?

It's actually a leftover of the earlier UCS-2 standard, before it was realized we'd need more codepoints than that, and that it was a mistake to limit to 16-bit space for codepoints in any encoding.

Software written for UCS-2 can mostly work compatibly with UTF-16, but there are some problems, encoding the 'higher' codepoints is only one of several. Another is how right-to-left scripts are handled.

http://www.differencebetween.net/technology/software-technol...

https://unicode.org/faq/utf_bom.html#utf16-11

flohofwoe · on April 8, 2021

Wasn't UTF-16 explicitly created as a "backward compatibility hack" for UCS-2 when it became clear that 16 bits per code point isn't enough? They should have ditched 16-bit encodings back then instead of combining the disadvantages of UTF-8 (variable length-encoding) and UTF-32 (not endian-agnostic).

jrochkind1 · on April 8, 2021

Perhaps unicode wouldn't be nearly as successfully adopted as it is, if they had left UCS-2 adopters hanging instead of providing them a "backward compatibility hack" path.

The UCS-2 adopters after all had been faithfully trying to implement the standard at that time. Among other things, showing implementers that if they choose to adopt, you aren't going to leave them hanging out to dry when you realize you made a mistake in the standard, will give other people more confidence to adopt.

But also, just generally I think a lesson of unicode's success -- as illustrated by UTF-8 in particular -- is, you have to give people a feasible path from where they are to adoption, this is a legitimate part of the design goals of a standard.

cygx · on April 8, 2021

Most of the hard stuff is there no matter the encoding (normalization, user-perceived characters spanning multiple code units, paths vs strings, ...).