Note that it says less than one *character*. A character in UTF-8 can be compose...

ChrisSD · on April 8, 2021

Note that UTF-8 has since been restricted to at most 4 bytes (i.e. the longest sequence is `T3 Tx Tx Tx`).

throwdbaaway · on April 8, 2021

So now we know who is really responsible for the whole MySQL utf8mb4 fiasco -- these 2 guys sitting in a diner, conjuring up a brilliant scheme to cover 4 billions characters, which turned out to exceed the actual requirement by more than 2000x.

September 1992: 2 guys scribbling on a placemat.

January 1998: RFC 2279 defines UTF-8 to be between 1 to 6 bytes.

March 2001: A bunch of CJK characters were added to Unicode Data 3.1.0, pushing the total to 94,140, exceeding the 16-bit limit of 3 bytes UTF-8.

March 2002: MySQL added support for UTF-8, initially setting the limit to 6 bytes (https://github.com/mysql/mysql-server/commit/55e0a9c)

September 2002: MySQL decided to reduce the limit to 3 bytes, probably for storage efficiency reason (https://github.com/mysql/mysql-server/commit/43a506c, https://adamhooper.medium.com/in-mysql-never-use-utf8-use-ut...)

November 2003: RFC 3629 defines UTF-8 to be between 1 to 4 bytes.

Arguably, if the placemat was smaller and the guys stopped at 4 bytes after running out of space, perhaps MySQL would have done the right thing? Ah, who am I kidding. The same commit would likely still happen.

EDIT: Just notice this in the footnotes, and the plot thickens...

> The 4, 5, and 6 byte sequences are only there for political reasons. I would prefer to delete these.

So UTF-8 was indeed intended to be utf8mb3!

sn41 · on April 8, 2021

This is also a very simple form of using the idea of a "prefix-free code" from information theory and coding. (the codes {0,10,110,1110,11110,...,111111} is a prefix-free set).

I think there's also the idea that the code can "sync up" when it say, starts in the middle of a character.