I think we should take a moment to appreciate how great UTF-8 is, and how it wel...

julian37 · on April 8, 2021

UTF-8 is just... so well designed.

One little feature I like in particular is that if you're looking for an ASCII-7 character in a UTF-8 stream -- say, a LF or comma -- you don't have to decode the stream first because all bytes in the encoding of non-ASCII-7 characters have the high bit set. Or as Wikipedia puts it:

> Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as / (slash) in filenames, \ (backslash) in escape sequences, and % in printf.

It's amazing to hear they put it together in one night at a diner! :-D

foobarian · on April 8, 2021

> It's amazing to hear they put it together in one night at a diner! :-D

I guess you're saying that in good humor. But I'll add this because it makes me appreciate how these things happen:

> What happened was this. We had used the original UTF from ISO 10646 to make Plan 9 support 16-bit characters, but we hated it.

"We hated it" -- there is just so much going on in those 3 words. They could have been suffering with the previous state for a year for all we know. And even if not, to know you hate something just takes a lot of system building experience to get to. And then when opportunity struck they probably already had a laundry list of grievances they had built up over that time and were ready to pounce.

jrochkind1 · on April 8, 2021

Yes, exactly!

If they hadn't had on-the-ground experience of the plan-9 version, and been able to see what parts of it they wanted to keep and what parts needed to be done different from that actual experience...

Often you can't build the polished thing until you have experienced the thing before.

Lately I get discouraged that there seems to be not so much attention to "prior art" in software development, that's the only way to make progress!

danesparza · on April 8, 2021

But to build it in 4 days!

This still strikes me as the height of 1990s programming moxy.

silisili · on April 8, 2021

While the design is nice, it doesn't seem -that- earthshattering that it was done in four days. Once you make the realiziation that 'wait, ascii only needs the lower 7 bits, let's work off that', it's all just details past that.

Don't get me wrong, I love UTF-8 and it is well thought out and designed. But the end result is not so complicated, so much so that pretty much anyone reading the rules could understand it.

I think there was just a lot of low hanging fruit in the 90s that doesn't exist today, as they are solved problems. Today's 'amazing' things would involve image recognition or processing, self driving cars, better ML/AI algos. Things that are hard to impossible to be done by a guy or two over the weekend.

Sadly, as a result, I think we'll have fewer 'programming heroes' than existed in previous decades.

throw0101a · on April 8, 2021

> While the design is nice, it doesn't seem -that- earthshattering that it was done in four days.

And yet it may have needed a genius to desgin and write something so simple. UTF-8 was not the first multi-lingual encoding system; here's an entire list of them, worked on by a lot of probably very smart people:

* https://en.wikipedia.org/wiki/Template:Character_encodings

It only seems 'obvious' in hindsight:

* https://en.wikipedia.org/wiki/Hindsight_bias

Edit: A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. — Antoine de Saint-Exupery

setr · on April 8, 2021

>I think there was just a lot of low hanging fruit in the 90s that doesn't exist today, as they are solved problems.

git was 2005, and that was probably similarly impactful in the version control space (in that it was much closer to fundamentally correct, than its predecessors). And there are quite a few standards out there that only survive by virtue of already having been established -- not because they meet any reasonable bar of quality. IPv4 (and all the grand schemes to work around the terror of NAT), email (the worst communication system, except for all the others), SQL (the language specifically -- a mishmash of keywords with almost no ability to properly compose), etc.

The bigger difference I think between the 90's and now is that it was probably much easier to make your new superior standard actually be used -- you could implement a new kernel today which was fantastically superior to linux, and you're much more likely than not to get zero traction (ex: plan9) simply by virtue of how well-entrenched linux already is.

saghm · on April 9, 2021

> git was 2005

I'm not sure I'd consider git to be "low-hanging fruit"

setr · on April 9, 2021

Given that Torvalds apparently went from design to implementation in 3 days, and 2 months later had it officially managing the kernel, I wouldn’t say it was particularly high-hanging.

danesparza · on April 9, 2021

Pretty sure Git was a side project so that Linus could manage Linux source code like he wanted.

losvedir · on April 8, 2021

Yeah, this is great! I came across that recently when working on a parser in Zig, which treats strings as arrays of bytes. I didn't know much about UTF8 other than that it's scary and programmers mess up text processing all the time. I was worried that a multi byte code point could trick my simple char switch which was looking for certain ASCII characters. But then I came across that bit you quoted and was but surprised and relieved!

Then, when I needed to minimally handle non-ASCII characters I found Zig's minimal unicode helper library and saw what I was looking for in a small function that takes a leading byte and returns how many bytes there are in the codepoint. I was impressed with the spec again!

comeonseriously · on April 8, 2021

> It's amazing to hear they put it together in one night at a diner! :-D

On the one hand, sure. But on the other you have Ken Thompson.

cout · on April 8, 2021

I wonder how many pieces of computing technology used today were put together in a single evening by a team of motivated developers. Rubygems, for example, was written in a couple of hours at the back of a hotel bar, then demoed (complete with network install and versioning) at Rubyconf the following morning.

As I age, I'm starting to believe that the best technology is often built this way, rather than stewing for years in an ISO subcommittee. Limited development time can lead to features that provide the greatest value for the time spent.

rectang · on April 8, 2021

Here's a picture of Thompson designing UTF-8 on a placemat that night at the diner:

https://www.youtube.com/watch?v=mhvaeHoIE24&t=23m34s

JamesCoyne · on April 8, 2021

Thanks for that link!

rkangel · on April 8, 2021

> It's amazing to hear they put it together in one night at a diner! :-D

I will bet that he had half formed ideas of how it could work from the previous pain with the "original UTF". The best people I work with are constantly looking at things that are wrong and coming up with idea for how they could be better even if 99% of them will never be used.

coliveira · on April 8, 2021

I think this is more of a case where we were lucky, since most applications used 7-bit ASCII and the high bit was available for UTF-8 encoding.

tkgally · on April 8, 2021

> China and Japan could have stayed with their own encodings.

Absolutely correct. There was a big debate in Japan in the 1990s about character encodings, with some people arguing strongly against the adoption of Unicode. Their main argument, as I remember it, was that Unicode didn’t capture all of the variations in kanji, especially for personal names.

For those of us who were trying to use Japanese online at the time, though, those arguments seemed beside the point. While it would have been nice, in an ideal world, to be able to encode and display all 51 (at least) variations in the second kanji for the surname Watanabe [1], we were faced with the daily frustration of trying to convert between JIS, S-JIS, EUC, and other encodings and often not being able to exchange Japanese text at all with people who hadn’t installed special software on their computers. It was a great relief when UTF-8 became adopted universally.

And now we have emoji, too!

[1] https://www.fujitv-view.jp/gallery/post-149246/?imgid=1

pezezin · on April 8, 2021

Tell that to my coworkers. I still get emails encoded in SJIS every day, sometimes with attachments with the file name also encoded in SJIS, which results in funny mojibake when saving them to disk. Not to mention the many web forms that insist you need to write your name in full-width characters or whatever funky shit.

On the other I recently got some Python scripts to crash because someone in the European team decided to encode some texts in ISO-8859-1 and Python assumes everything is in UTF-8.

I really, really wish one day all legacy encodings will disappear from the face of the Earth and only UTF-8 will stay.

nonbirithm · on April 8, 2021

Not to mention that the Linux unzip utility doesn't have a way to handle Shift-JIS filenames, or really any filename encodings besides UTF-8. You have to use an entirely different program like unzip-jp just for those files, in order to not be left with dozens of unintelligible folder names.

There's a reason the underground community calls it "shit-jizz."

anthk · on April 8, 2021

Iconv is your friend.

On that issue, infsp6 (the Spanish library for Inform6, akin to the English inform6lib one) still uses iso8859-15 and it's a pain in the ass to convert the encoding to and from utf8 if you don't use neither emacs, joe or vim to edit the source code (I use nvi).

pezezin · on April 8, 2021

Mátalos a todos y que dios elija a los suyos.

But at least it's not EBCDIC, the day I find that in the wild is the day I will retire from computers and become a farmer.

aYsY4dDQ2NrcNzA · on April 8, 2021

Out of curiosity, have you tried the UTF-8 decoder capability and stress test?

https://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html

pezezin · on April 9, 2021

No, why?

Thunderbird will display SJIS emails just fine. The problem with attachments is when some adds a ZIP with SJIS filenames, but then it's not Thunderbird's problem but whatever tool you use to decompress it.

Regarding Python, the default behaviour when decoding and invalid UTF-8 strings is to raise an exception. But your comment made me research it and I just found that there is a way to replace invalid bytes with U+FFFD, so I will try it.

lmm · on April 8, 2021

SJIS is still pretty actively used, and Han unification is the most likely culprit. In hindsight it really does feel like a mistake.

ChrisSD · on April 8, 2021

Han unification was definitely a mistake. To this day people in different countries will use different fonts so that text looks how it is supposed to in their language.

The promise of unicode was that you can losslessly convert any encoding to unicode. However, because of the failed attempt at Han unification, some important information can be lost.

ksec · on April 9, 2021

Exactly, UTF-8 is great, except this part. In the name of unification you destroyed the culture that was there. But modern days people are fine with it as long as it is not their culture. And no, adding fonts or notation doesn't solve the problem. I remember I read a very nicely put analogy on HN years ago.

Edit: https://news.ycombinator.com/item?id=8041288

hnfong · on April 11, 2021

The link describes the problems but _not_ having a unified CJK table results in other problems. I regularly read texts in both Simplified and Traditional Chinese and some Japanese, and it's nice that at least characters that are uncontroversially identical have the same code point, so that what I type in my preferred input method could be used interchangeably (think eg. text searching).

There's enough trouble already with having a "handful" of commonly seen character variants between SC, TC and JP, making things like copyediting text a pain if you're not well versed in unicode shenanigans. I am involved in maintaining a Cantonese dictionary, and inadvertent duplicate entries occur very often because of these variants (yes, people sometimes, somehow, manage to input characters outside of the supposed locale -- and never notice because they look very similar).

Fundamentally the problem is that European notions of "character" breaks down in East Asian languages. Not sure whether this is anyone's fault, and if we designed it again today with hindsight and improved (font rendering, text searching/processing etc.) technology we might have decided to construct Han "characters" with radicals instead. But in the 90s there might not have been a better choice.

PS: not to mention the political fallout with having a separate table for Mainland-Chinese and Taiwan-Chinese if that were a policy. It's like having an English "a" and an American "a". Should Wales, Scotland and Northern Ireland have a different 'a' then? Why shouldn't the 50 states of USA each have a different 'a'. It'd bring the politics of the United Nations into Unicode.

fomine3 · on April 9, 2021

Han unification isn't main blocker to transition to utf-8 in Japan. Just use Japanese font in this context.

Why SJIS is still used is because there are many legacy systems and developers who still think SJIS is fine. We not tend to treat other languages so it mostly works (without emoji). SJIS is sometimes useful because 1-byte char is half width, and 2-bytes char is full width by design. Old developers still call Japanese character as "2-bytes character" even though the system is utf-8.

Another reason is that Windows OEM codepage is still CP932 (extended SJIS). It's pain like this: https://discuss.python.org/t/pep-597-enable-utf-8-mode-by-de...

earthboundkid · on April 8, 2021

Han unification is a bullshit excuse. Is two story 'a' a different letter than one story? Is seven with a slash through it different than seven without? Is Japanese as written in pre-war books a different language than Japanese in post-war books?

Unicode may have dropped a couple of variants, but they basically all got added back. There's no problem with Han unification; there's just a FUD campaign powered by nationalism and ignorance that is used to justify everyday technological inertia.

ant6n · on April 8, 2021

> While it would have been nice, in an ideal world, to be able to encode and display all 51 (at least) variations in the second kanji for the surname Watanabe [1],...

Unicode has gotten so big, isn’t this included by now?

bmn__ · on April 8, 2021

http://enwp.org/Z-variant

lifthrasiir · on April 8, 2021

Also see the IVD [1]. Indeed both 邉 (U+9089) and 邊 (U+908A) are exceptionally variable characters, the first having 32 variation sequences (the record as of 2020-11-06) and the second having 21 variation sequences.

[1] https://unicode.org/ivd/

anthk · on April 8, 2021

Does the Japanese goverment use Hiragana on official documments in order to properly spell out names easily?

tkgally · on April 8, 2021

Either hiragana or katakana is used on most official documents for convenience. On the family registers (戸籍 koseki), which are perhaps the most important, though, the readings of names are not listed. For people whose names are written only in kanji, those kanji, and not the readings, are the legal versions of their names.

roblabla · on April 8, 2021

> UTF-16 is still lurking here and there

As someone currently stuck in the windows world, this hurts. Every single Windows API is still stuck with using UTF-16/UCS2 as the string encoding.

Also fun fact, on the Nintendo Switch, various subsystems use different kind of encoding. The filsystem submodule uses Shift-JIS, most of the other modules use UTF-8, but some others yet use UTF-16 (like the virtual keyboard, IIRC). A brilliant mess.

jhasse · on April 8, 2021

Windows finally added support for UTF-8 2 years ago: https://docs.microsoft.com/en-us/windows/uwp/design/globaliz...

ChrisSD · on April 8, 2021

Technically it's had support since (IIRC) Windows 7. What this does is call the translation functions for you instead of having to do it yourself.

huhtenberg · on April 8, 2021

Conversion functions - MultiByteToWideChar & co. - were in since Windows 2000 and the UTF8 codepage was supported as early as XP if not in W2K as well.

gpvos · on April 8, 2021

It existed in W2K and maybe even earlier, but there were bugs in the console regarding codepage 65001, so you couldn't use it as the default. This was not fixed yet in XP, maybe in 7 though.

ChrisSD · on April 8, 2021

Ah thanks! It's funny because I can't recall ever using code page 65001 before 7. Maybe there was a reason for that or maybe I simply didn't know it existed until then. Or maybe I thought it simpler to just use UTF-16. I can't remember.

vbezhenar · on April 8, 2021

It is not default configuration and it's marked as "experimental" in UI. I would never enable it for my PC, that's just absurd.

iudqnolq · on April 8, 2021

Yeah, the "might" here is doing a lot of work

> As Windows operates natively in UTF-16 (WCHAR), you might need to convert UTF-8 data to UTF-16 (or vice versa) to interoperate with Windows APIs.

jhasse · on April 9, 2021

I mean that application developers can enable it in their manifest and don't have to do UTF-16 <-> UTF-8 conversions anymore.

vbezhenar · on April 9, 2021

I thought that's about user choosing UTF-8 as codepage in regional settings.

If it could be set in application manifest, that's a good thing.

Although I guess that in the end Windows will perform the save conversions in user32.dll, so it does not really matter.

ChrisSD · on April 9, 2021

Windows does allow setting it in the application's manifest but it also requires a registry setting to be enabled otherwise the manifest option is ignored. Obviously asking users to edit the registry is a non-starter so it's only used where the developers also control the user environment (e.g. the registry change is deployed through group policies, etc).

But yeah, this just tells Windows to do the conversion so that programmers don't have to type out the function calls themselves. It's simple enough to create a wrapper function for Windows API calls in any case.

coliveira · on April 8, 2021

Java is still using UTF-16, it is the internal format used since its creation. I don't know exactly how much this is a problem or not, but it shows that UTF-16 is still an important thing.

rcoveson · on April 8, 2021

I think its a huge problem for Java. Try doing proper string collation (standard library or ICU4J), or regular expression matching, in a context where your strings are all UTF-8 and your output should also be UTF-8. Operations that shouldn't require allocation do, because you have to transcode to UTF-16. Not to mention that in some cases, that transcoding is the most expensive part of the operation.

All the core Java APIs are built around String or CharSequence (more the latter in releases post-Java 8). CharSequence is a terrible interface for supporting UTF-8 or any encoding besides latin1 or UTF-16. If Java's interfaces had been designed around Unicode codepoint iteration rather than char random access, then the coupling to UTF-16 wouldn't have been so tight. But as things stand, you aren't doing anything interesting to text in Java without either (1) re-implementing everything from scratch, from integer parsing to regexp, or (2) paying the transcode cost on everything your program consumes and emits.

cryptonector · on April 8, 2021

It's a huge problem. UTF-16 is a big big pain.

JavaScript (ECMAScript) too has this problem.

ChrisSD · on April 8, 2021

Personally I don't find UTF-16 to be too bad. It's a simple encoding and very easy to convert to/from UTF-8. So your program can be written in UTF-8 and your WinAPI wrappers can convert as/when needed.

jeltz · on April 8, 2021

The bad thing with UTF-16 is that so much software assumes that one code point always is 16 bits.

jrochkind1 · on April 8, 2021

Which is not UTF-16 at all, UTF-16 standard clearly says this is not so. So why do they do that?

It's actually a leftover of the earlier UCS-2 standard, before it was realized we'd need more codepoints than that, and that it was a mistake to limit to 16-bit space for codepoints in any encoding.

Software written for UCS-2 can mostly work compatibly with UTF-16, but there are some problems, encoding the 'higher' codepoints is only one of several. Another is how right-to-left scripts are handled.

http://www.differencebetween.net/technology/software-technol...

https://unicode.org/faq/utf_bom.html#utf16-11

flohofwoe · on April 8, 2021

Wasn't UTF-16 explicitly created as a "backward compatibility hack" for UCS-2 when it became clear that 16 bits per code point isn't enough? They should have ditched 16-bit encodings back then instead of combining the disadvantages of UTF-8 (variable length-encoding) and UTF-32 (not endian-agnostic).

jrochkind1 · on April 8, 2021

Perhaps unicode wouldn't be nearly as successfully adopted as it is, if they had left UCS-2 adopters hanging instead of providing them a "backward compatibility hack" path.

The UCS-2 adopters after all had been faithfully trying to implement the standard at that time. Among other things, showing implementers that if they choose to adopt, you aren't going to leave them hanging out to dry when you realize you made a mistake in the standard, will give other people more confidence to adopt.

But also, just generally I think a lesson of unicode's success -- as illustrated by UTF-8 in particular -- is, you have to give people a feasible path from where they are to adoption, this is a legitimate part of the design goals of a standard.

cygx · on April 8, 2021

Most of the hard stuff is there no matter the encoding (normalization, user-perceived characters spanning multiple code units, paths vs strings, ...).

jrochkind1 · on April 8, 2021

Yes, thank you for saying so!

Unicode often gets a lot of online hate, which frustrates me, as I agree with you -- Unicode in general it is a remarkably succesful standard, technically as well as with regard to adoption.

It's adoption success isn't a coincidence, it's a result of choices made in the design -- with UTF-8 being a big part of that. The choices sometimes involve trade-offs, which lead to the things people complain about (say, the two different codepoint arrangements which can be a é -- there's a reason for that, again related to easing the on-ramp to unicode from legacy technologies, as one of the main goals of UTF-8).

There are always trade-offs, nothing is perfect. But Unicode sometimes seems to me to be almost the optimal balance of all the different concerns, I think they could hardly have done better!

The "UCS=>UTF-16" mis-step was unfortunate, and we are still dealing with some of the consequences (Java/Windows)... but the fact that we made it through with Unicode adoption only continuing to grow, is a testament to Unicode's good design again.

It's not until I ran into some of the "backwaters" of Unicode, realizing they had thought out and specified how to do things like "case-insensitive" normalized collation/comparison for a variety of different specifications in a localized and reasonably performant way...

We are so lucky for Unicode.

selfhoster11 · on April 12, 2021

Hate against Unicode frustrates me too. It's like people are privileged and don't know what it's like to have at least 2 code pages for your language. I _still_ have to fix files from those encodings and back into Unicode. Unicode is a blessing even if many don't see it.

silvestrov · on April 8, 2021

It's the same with html and css: people shit on it all the time, but this just shows they don't have the imagination to see how much worse it could be.

Just compare to e.g. Photoshop file format: https://github.com/gco/xee/blob/master/XeePhotoshopLoader.m#...

greggman3 · on April 8, 2021

The photoshop file format is fine for what is. The format to explode your head is the MS Office .doc format.

chubot · on April 8, 2021

Sort of related: I learned from reading about Facebook's lack of moderation that Myanmar is one of the few countries that doesn't use Unicode (and hence UTF-8). It uses something called Zawgyi that apparently has to be heuristically detected!

https://en.wikipedia.org/wiki/Myanmar_(Unicode_block)#Histor...

https://www.globalapptesting.com/blog/zawgyi-vs-unicode

earthboundkid · on April 8, 2021

Facebook is despicable and indefensible. They knew that they could not moderate Myanmar. They knew or should have known that it was a volatile political situation. The amount of money involved could not have been more than a few million dollars. They should have just turned everything off and said we'll come back when we can. It's disgusting what they did and they should never be forgiven for putting market position ahead of human lives.

swiley · on April 8, 2021

ttf is also nearly universally supported. Working with text correctly is just really hard so I think people like to reach for stuff that already does it for them.