Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Note that it says less than one character. A character in UTF-8 can be composed of multiple bytes.

The encoding scheme is laid out in the linked email. Based on the high bits it's possible to detect when a new character starts. Relevant portion:

  We define 7 byte types:
  T0 0xxxxxxx      7 free bits
  Tx 10xxxxxx      6 free bits
  T1 110xxxxx      5 free bits
  T2 1110xxxx      4 free bits
  T3 11110xxx      3 free bits
  T4 111110xx      2 free bits
  T5 111111xx      2 free bits

  Encoding is as follows.
  >From hex Thru hex      Sequence             Bits
  00000000  0000007f      T0                   7
  00000080  000007FF      T1 Tx                11
  00000800  0000FFFF      T2 Tx Tx             16
  00010000  001FFFFF      T3 Tx Tx Tx          21
  00200000  03FFFFFF      T4 Tx Tx Tx Tx              26
  04000000  FFFFFFFF      T5 Tx Tx Tx Tx Tx    32
[...]

  4. All of the sequences synchronize on any byte that is not a Tx byte.
If you are starting mid-run, skip initial Tx bytes. That will always be less than one character.


Note that UTF-8 has since been restricted to at most 4 bytes (i.e. the longest sequence is `T3 Tx Tx Tx`).


So now we know who is really responsible for the whole MySQL utf8mb4 fiasco -- these 2 guys sitting in a diner, conjuring up a brilliant scheme to cover 4 billions characters, which turned out to exceed the actual requirement by more than 2000x.

September 1992: 2 guys scribbling on a placemat.

January 1998: RFC 2279 defines UTF-8 to be between 1 to 6 bytes.

March 2001: A bunch of CJK characters were added to Unicode Data 3.1.0, pushing the total to 94,140, exceeding the 16-bit limit of 3 bytes UTF-8.

March 2002: MySQL added support for UTF-8, initially setting the limit to 6 bytes (https://github.com/mysql/mysql-server/commit/55e0a9c)

September 2002: MySQL decided to reduce the limit to 3 bytes, probably for storage efficiency reason (https://github.com/mysql/mysql-server/commit/43a506c, https://adamhooper.medium.com/in-mysql-never-use-utf8-use-ut...)

November 2003: RFC 3629 defines UTF-8 to be between 1 to 4 bytes.

Arguably, if the placemat was smaller and the guys stopped at 4 bytes after running out of space, perhaps MySQL would have done the right thing? Ah, who am I kidding. The same commit would likely still happen.

EDIT: Just notice this in the footnotes, and the plot thickens...

> The 4, 5, and 6 byte sequences are only there for political reasons. I would prefer to delete these.

So UTF-8 was indeed intended to be utf8mb3!


This is also a very simple form of using the idea of a "prefix-free code" from information theory and coding. (the codes {0,10,110,1110,11110,...,111111} is a prefix-free set).

I think there's also the idea that the code can "sync up" when it say, starts in the middle of a character.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: