Interesting comment near the end, "2. The 4, 5, and 6 byte sequences are only th...

Safety1stClyde · on April 8, 2021

Unicode was originally meant to fit everything into 16 bits. UTF-8 is just a way to encode, it doesn't decide what goes into Unicode.

chrisseaton · on April 8, 2021

> UTF-8 is just a way to encode, it doesn't decide what goes into Unicode.

If UTF-8 had not had sequences above 3 bytes, you would not have been able to use it to express Unicode characters as high as Emoji which would certainly have hampered their adoption, is what the person you're replying to means.

a1369209993 · on April 8, 2021

While your conclusion is largely correct, it doesn't follow from your premises: UTF-16 is just a way to encode, but its brain-damaged surrogate pair mechanism very much did get baked into Unicode (namely, high and low surrogate code points D800-DFFF).

morelisp · on April 8, 2021

The 5-6 byte variants (and also 4 at the time) exist because of the need to round-trip UCS surrogate pairs through UTF-8, no? That's what I assume the "political reasons" are...

Sniffnoy · on April 8, 2021

Others have already answered why surrogate pairs are irrelevant (and not UCS), but I think it's worth saying what the probable actual reason for 5-6 byte variants was. Remember that UCS and Unicode were at this point still two separate things; Unicode was supposed to be 16-bit (and later it got expanded, causing the whole surrogates mess), while UCS was supposed to be 31-bit. I assume the 5-6 byte variants were for UCS (back before it got merged with Unicode).

ChrisSD · on April 8, 2021

Surrogate pairs are only in UTF-16 so as to encode code points that require more than 16 bits. UTF-8 has no need of them because it's already a variable width encoding.

If there were no code points larger than 16 bits then UTF-8 would only need a maximum of 3 bytes per code point and UTF-16 wouldn't need surrogate pairs. Well actually UTF-16 probably wouldn't exist at all because UCS-2 would have been enough for everybody.

cryptonector · on April 8, 2021

No. They exist to encode 31 bits of codepoint space, but later the UC decided to limit the codepoint space to only 21 bits because that is what UTF-16 is limited to, and then UTF-8 no longer needed to support sequences of 5 and 6 bytes.

0x0 · on April 8, 2021

I don't think so? Aren't UCS surrogate pairs at most 16bit each by their very purpose? Also, >16bit unicode code points came much later, I believe, in Unicode 2.0 in 1996 according to Wikipedia (vs UTF-8 which is from around 1992)