Interesting comment near the end, "2. The 4, 5, and 6 byte sequences are only there for political reasons. I would prefer to delete these."
If that had happened, I guess emojis as we know them today might never have happened, since it would have limited us to 16 bits of code points. Or we would have had to start doing surrogate pairs even in UTF-8. Close call.
> UTF-8 is just a way to encode, it doesn't decide what goes into Unicode.
If UTF-8 had not had sequences above 3 bytes, you would not have been able to use it to express Unicode characters as high as Emoji which would certainly have hampered their adoption, is what the person you're replying to means.
While your conclusion is largely correct, it doesn't follow from your premises: UTF-16 is just a way to encode, but its brain-damaged surrogate pair mechanism very much did get baked into Unicode (namely, high and low surrogate code points D800-DFFF).
The 5-6 byte variants (and also 4 at the time) exist because of the need to round-trip UCS surrogate pairs through UTF-8, no? That's what I assume the "political reasons" are...
Others have already answered why surrogate pairs are irrelevant (and not UCS), but I think it's worth saying what the probable actual reason for 5-6 byte variants was. Remember that UCS and Unicode were at this point still two separate things; Unicode was supposed to be 16-bit (and later it got expanded, causing the whole surrogates mess), while UCS was supposed to be 31-bit. I assume the 5-6 byte variants were for UCS (back before it got merged with Unicode).
Surrogate pairs are only in UTF-16 so as to encode code points that require more than 16 bits. UTF-8 has no need of them because it's already a variable width encoding.
If there were no code points larger than 16 bits then UTF-8 would only need a maximum of 3 bytes per code point and UTF-16 wouldn't need surrogate pairs. Well actually UTF-16 probably wouldn't exist at all because UCS-2 would have been enough for everybody.
No. They exist to encode 31 bits of codepoint space, but later the UC decided to limit the codepoint space to only 21 bits because that is what UTF-16 is limited to, and then UTF-8 no longer needed to support sequences of 5 and 6 bytes.
I don't think so? Aren't UCS surrogate pairs at most 16bit each by their very purpose? Also, >16bit unicode code points came much later, I believe, in Unicode 2.0 in 1996 according to Wikipedia (vs UTF-8 which is from around 1992)
If that had happened, I guess emojis as we know them today might never have happened, since it would have limited us to 16 bits of code points. Or we would have had to start doing surrogate pairs even in UTF-8. Close call.