So now we know who is really responsible for the whole MySQL utf8mb4 fiasco -- these 2 guys sitting in a diner, conjuring up a brilliant scheme to cover 4 billions characters, which turned out to exceed the actual requirement by more than 2000x.
September 1992: 2 guys scribbling on a placemat.
January 1998: RFC 2279 defines UTF-8 to be between 1 to 6 bytes.
March 2001: A bunch of CJK characters were added to Unicode Data 3.1.0, pushing the total to 94,140, exceeding the 16-bit limit of 3 bytes UTF-8.
November 2003: RFC 3629 defines UTF-8 to be between 1 to 4 bytes.
Arguably, if the placemat was smaller and the guys stopped at 4 bytes after running out of space, perhaps MySQL would have done the right thing? Ah, who am I kidding. The same commit would likely still happen.
EDIT: Just notice this in the footnotes, and the plot thickens...
> The 4, 5, and 6 byte sequences are only there for
political reasons. I would prefer to delete these.
This is also a very simple form of using the idea of a "prefix-free code" from information theory and coding. (the codes {0,10,110,1110,11110,...,111111} is a prefix-free set).
I think there's also the idea that the code can "sync up" when it say, starts in the middle of a character.
The encoding scheme is laid out in the linked email. Based on the high bits it's possible to detect when a new character starts. Relevant portion:
[...] If you are starting mid-run, skip initial Tx bytes. That will always be less than one character.