I think we should take a moment to appreciate how great UTF-8 is, and how it well it worked out. It's easy to get disillusioned with internet standards when IPv6 is taking forever and messaging is all proprietary locked down protocols. Yet character encodings used to be a horrible mess and now it's not. In the 90's the only practical solution was for everyone to use the same OS, same word processor, same web browser, and who cares about talking with foreigners anyway?
I don't think it was always guaranteed to turn out well. China and Japan could have stayed with their own encodings. Microsoft and Apple could have done incompatible things. The tech world is full of bad things we're stuck with because there's no way to coordinate a change.
Unicode has it's flaws, UTF-16 is still lurking here and there, everyone loves to argue about emoji, but overall text just works now.
One little feature I like in particular is that if you're looking for an ASCII-7 character in a UTF-8 stream -- say, a LF or comma -- you don't have to decode the stream first because all bytes in the encoding of non-ASCII-7 characters have the high bit set. Or as Wikipedia puts it:
> Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as / (slash) in filenames, \ (backslash) in escape sequences, and % in printf.
It's amazing to hear they put it together in one night at a diner! :-D
> It's amazing to hear they put it together in one night at a diner! :-D
I guess you're saying that in good humor. But I'll add this because it makes me appreciate how these things happen:
> What happened was this. We had used the original UTF from ISO 10646 to make Plan 9 support 16-bit characters, but we hated it.
"We hated it" -- there is just so much going on in those 3 words. They could have been suffering with the previous state for a year for all we know. And even if not, to know you hate something just takes a lot of system building experience to get to. And then when opportunity struck they probably already had a laundry list of grievances they had built up over that time and were ready to pounce.
If they hadn't had on-the-ground experience of the plan-9 version, and been able to see what parts of it they wanted to keep and what parts needed to be done different from that actual experience...
Often you can't build the polished thing until you have experienced the thing before.
Lately I get discouraged that there seems to be not so much attention to "prior art" in software development, that's the only way to make progress!
While the design is nice, it doesn't seem -that- earthshattering that it was done in four days. Once you make the realiziation that 'wait, ascii only needs the lower 7 bits, let's work off that', it's all just details past that.
Don't get me wrong, I love UTF-8 and it is well thought out and designed. But the end result is not so complicated, so much so that pretty much anyone reading the rules could understand it.
I think there was just a lot of low hanging fruit in the 90s that doesn't exist today, as they are solved problems. Today's 'amazing' things would involve image recognition or processing, self driving cars, better ML/AI algos. Things that are hard to impossible to be done by a guy or two over the weekend.
Sadly, as a result, I think we'll have fewer 'programming heroes' than existed in previous decades.
> While the design is nice, it doesn't seem -that- earthshattering that it was done in four days.
And yet it may have needed a genius to desgin and write something so simple. UTF-8 was not the first multi-lingual encoding system; here's an entire list of them, worked on by a lot of probably very smart people:
Edit: A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. — Antoine de Saint-Exupery
>I think there was just a lot of low hanging fruit in the 90s that doesn't exist today, as they are solved problems.
git was 2005, and that was probably similarly impactful in the version control space (in that it was much closer to fundamentally correct, than its predecessors). And there are quite a few standards out there that only survive by virtue of already having been established -- not because they meet any reasonable bar of quality. IPv4 (and all the grand schemes to work around the terror of NAT), email (the worst communication system, except for all the others), SQL (the language specifically -- a mishmash of keywords with almost no ability to properly compose), etc.
The bigger difference I think between the 90's and now is that it was probably much easier to make your new superior standard actually be used -- you could implement a new kernel today which was fantastically superior to linux, and you're much more likely than not to get zero traction (ex: plan9) simply by virtue of how well-entrenched linux already is.
Given that Torvalds apparently went from design to implementation in 3 days, and 2 months later had it officially managing the kernel, I wouldn’t say it was particularly high-hanging.
Yeah, this is great! I came across that recently when working on a parser in Zig, which treats strings as arrays of bytes. I didn't know much about UTF8 other than that it's scary and programmers mess up text processing all the time. I was worried that a multi byte code point could trick my simple char switch which was looking for certain ASCII characters. But then I came across that bit you quoted and was but surprised and relieved!
Then, when I needed to minimally handle non-ASCII characters I found Zig's minimal unicode helper library and saw what I was looking for in a small function that takes a leading byte and returns how many bytes there are in the codepoint. I was impressed with the spec again!
I wonder how many pieces of computing technology used today were put together in a single evening by a team of motivated developers. Rubygems, for example, was written in a couple of hours at the back of a hotel bar, then demoed (complete with network install and versioning) at Rubyconf the following morning.
As I age, I'm starting to believe that the best technology is often built this way, rather than stewing for years in an ISO subcommittee. Limited development time can lead to features that provide the greatest value for the time spent.
> It's amazing to hear they put it together in one night at a diner! :-D
I will bet that he had half formed ideas of how it could work from the previous pain with the "original UTF". The best people I work with are constantly looking at things that are wrong and coming up with idea for how they could be better even if 99% of them will never be used.
> China and Japan could have stayed with their own encodings.
Absolutely correct. There was a big debate in Japan in the 1990s about character encodings, with some people arguing strongly against the adoption of Unicode. Their main argument, as I remember it, was that Unicode didn’t capture all of the variations in kanji, especially for personal names.
For those of us who were trying to use Japanese online at the time, though, those arguments seemed beside the point. While it would have been nice, in an ideal world, to be able to encode and display all 51 (at least) variations in the second kanji for the surname Watanabe [1], we were faced with the daily frustration of trying to convert between JIS, S-JIS, EUC, and other encodings and often not being able to exchange Japanese text at all with people who hadn’t installed special software on their computers. It was a great relief when UTF-8 became adopted universally.
Tell that to my coworkers. I still get emails encoded in SJIS every day, sometimes with attachments with the file name also encoded in SJIS, which results in funny mojibake when saving them to disk. Not to mention the many web forms that insist you need to write your name in full-width characters or whatever funky shit.
On the other I recently got some Python scripts to crash because someone in the European team decided to encode some texts in ISO-8859-1 and Python assumes everything is in UTF-8.
I really, really wish one day all legacy encodings will disappear from the face of the Earth and only UTF-8 will stay.
Not to mention that the Linux unzip utility doesn't have a way to handle Shift-JIS filenames, or really any filename encodings besides UTF-8. You have to use an entirely different program like unzip-jp just for those files, in order to not be left with dozens of unintelligible folder names.
There's a reason the underground community calls it "shit-jizz."
On that issue, infsp6 (the Spanish library for Inform6, akin to the English inform6lib one) still uses iso8859-15 and it's a pain in the ass to convert the encoding to and from utf8 if you don't use neither emacs, joe or vim to edit the source code (I use nvi).
Thunderbird will display SJIS emails just fine. The problem with attachments is when some adds a ZIP with SJIS filenames, but then it's not Thunderbird's problem but whatever tool you use to decompress it.
Regarding Python, the default behaviour when decoding and invalid UTF-8 strings is to raise an exception. But your comment made me research it and I just found that there is a way to replace invalid bytes with U+FFFD, so I will try it.
Han unification was definitely a mistake. To this day people in different countries will use different fonts so that text looks how it is supposed to in their language.
The promise of unicode was that you can losslessly convert any encoding to unicode. However, because of the failed attempt at Han unification, some important information can be lost.
Exactly, UTF-8 is great, except this part. In the name of unification you destroyed the culture that was there. But modern days people are fine with it as long as it is not their culture. And no, adding fonts or notation doesn't solve the problem. I remember I read a very nicely put analogy on HN years ago.
The link describes the problems but _not_ having a unified CJK table results in other problems. I regularly read texts in both Simplified and Traditional Chinese and some Japanese, and it's nice that at least characters that are uncontroversially identical have the same code point, so that what I type in my preferred input method could be used interchangeably (think eg. text searching).
There's enough trouble already with having a "handful" of commonly seen character variants between SC, TC and JP, making things like copyediting text a pain if you're not well versed in unicode shenanigans. I am involved in maintaining a Cantonese dictionary, and inadvertent duplicate entries occur very often because of these variants (yes, people sometimes, somehow, manage to input characters outside of the supposed locale -- and never notice because they look very similar).
Fundamentally the problem is that European notions of "character" breaks down in East Asian languages. Not sure whether this is anyone's fault, and if we designed it again today with hindsight and improved (font rendering, text searching/processing etc.) technology we might have decided to construct Han "characters" with radicals instead. But in the 90s there might not have been a better choice.
PS: not to mention the political fallout with having a separate table for Mainland-Chinese and Taiwan-Chinese if that were a policy. It's like having an English "a" and an American "a". Should Wales, Scotland and Northern Ireland have a different 'a' then? Why shouldn't the 50 states of USA each have a different 'a'. It'd bring the politics of the United Nations into Unicode.
Han unification isn't main blocker to transition to utf-8 in Japan. Just use Japanese font in this context.
Why SJIS is still used is because there are many legacy systems and developers who still think SJIS is fine. We not tend to treat other languages so it mostly works (without emoji).
SJIS is sometimes useful because 1-byte char is half width, and 2-bytes char is full width by design. Old developers still call Japanese character as "2-bytes character" even though the system is utf-8.
Han unification is a bullshit excuse. Is two story 'a' a different letter than one story? Is seven with a slash through it different than seven without? Is Japanese as written in pre-war books a different language than Japanese in post-war books?
Unicode may have dropped a couple of variants, but they basically all got added back. There's no problem with Han unification; there's just a FUD campaign powered by nationalism and ignorance that is used to justify everyday technological inertia.
> While it would have been nice, in an ideal world, to be able to encode and display all 51 (at least) variations in the second kanji for the surname Watanabe [1],...
Unicode has gotten so big, isn’t this included by now?
Also see the IVD [1]. Indeed both 邉 (U+9089) and 邊 (U+908A) are exceptionally variable characters, the first having 32 variation sequences (the record as of 2020-11-06) and the second having 21 variation sequences.
Either hiragana or katakana is used on most official documents for convenience. On the family registers (戸籍 koseki), which are perhaps the most important, though, the readings of names are not listed. For people whose names are written only in kanji, those kanji, and not the readings, are the legal versions of their names.
As someone currently stuck in the windows world, this hurts. Every single Windows API is still stuck with using UTF-16/UCS2 as the string encoding.
Also fun fact, on the Nintendo Switch, various subsystems use different kind of encoding. The filsystem submodule uses Shift-JIS, most of the other modules use UTF-8, but some others yet use UTF-16 (like the virtual keyboard, IIRC). A brilliant mess.
Conversion functions - MultiByteToWideChar & co. - were in since Windows 2000 and the UTF8 codepage was supported as early as XP if not in W2K as well.
It existed in W2K and maybe even earlier, but there were bugs in the console regarding codepage 65001, so you couldn't use it as the default. This was not fixed yet in XP, maybe in 7 though.
Ah thanks! It's funny because I can't recall ever using code page 65001 before 7. Maybe there was a reason for that or maybe I simply didn't know it existed until then. Or maybe I thought it simpler to just use UTF-16. I can't remember.
Windows does allow setting it in the application's manifest but it also requires a registry setting to be enabled otherwise the manifest option is ignored. Obviously asking users to edit the registry is a non-starter so it's only used where the developers also control the user environment (e.g. the registry change is deployed through group policies, etc).
But yeah, this just tells Windows to do the conversion so that programmers don't have to type out the function calls themselves. It's simple enough to create a wrapper function for Windows API calls in any case.
Java is still using UTF-16, it is the internal format used since its creation. I don't know exactly how much this is a problem or not, but it shows that UTF-16 is still an important thing.
I think its a huge problem for Java. Try doing proper string collation (standard library or ICU4J), or regular expression matching, in a context where your strings are all UTF-8 and your output should also be UTF-8. Operations that shouldn't require allocation do, because you have to transcode to UTF-16. Not to mention that in some cases, that transcoding is the most expensive part of the operation.
All the core Java APIs are built around String or CharSequence (more the latter in releases post-Java 8). CharSequence is a terrible interface for supporting UTF-8 or any encoding besides latin1 or UTF-16. If Java's interfaces had been designed around Unicode codepoint iteration rather than char random access, then the coupling to UTF-16 wouldn't have been so tight. But as things stand, you aren't doing anything interesting to text in Java without either (1) re-implementing everything from scratch, from integer parsing to regexp, or (2) paying the transcode cost on everything your program consumes and emits.
Personally I don't find UTF-16 to be too bad. It's a simple encoding and very easy to convert to/from UTF-8. So your program can be written in UTF-8 and your WinAPI wrappers can convert as/when needed.
Which is not UTF-16 at all, UTF-16 standard clearly says this is not so. So why do they do that?
It's actually a leftover of the earlier UCS-2 standard, before it was realized we'd need more codepoints than that, and that it was a mistake to limit to 16-bit space for codepoints in any encoding.
Software written for UCS-2 can mostly work compatibly with UTF-16, but there are some problems, encoding the 'higher' codepoints is only one of several. Another is how right-to-left scripts are handled.
Wasn't UTF-16 explicitly created as a "backward compatibility hack" for UCS-2 when it became clear that 16 bits per code point isn't enough? They should have ditched 16-bit encodings back then instead of combining the disadvantages of UTF-8 (variable length-encoding) and UTF-32 (not endian-agnostic).
Perhaps unicode wouldn't be nearly as successfully adopted as it is, if they had left UCS-2 adopters hanging instead of providing them a "backward compatibility hack" path.
The UCS-2 adopters after all had been faithfully trying to implement the standard at that time. Among other things, showing implementers that if they choose to adopt, you aren't going to leave them hanging out to dry when you realize you made a mistake in the standard, will give other people more confidence to adopt.
But also, just generally I think a lesson of unicode's success -- as illustrated by UTF-8 in particular -- is, you have to give people a feasible path from where they are to adoption, this is a legitimate part of the design goals of a standard.
Unicode often gets a lot of online hate, which frustrates me, as I agree with you -- Unicode in general it is a remarkably succesful standard, technically as well as with regard to adoption.
It's adoption success isn't a coincidence, it's a result of choices made in the design -- with UTF-8 being a big part of that. The choices sometimes involve trade-offs, which lead to the things people complain about (say, the two different codepoint arrangements which can be a é -- there's a reason for that, again related to easing the on-ramp to unicode from legacy technologies, as one of the main goals of UTF-8).
There are always trade-offs, nothing is perfect. But Unicode sometimes seems to me to be almost the optimal balance of all the different concerns, I think they could hardly have done better!
The "UCS=>UTF-16" mis-step was unfortunate, and we are still dealing with some of the consequences (Java/Windows)... but the fact that we made it through with Unicode adoption only continuing to grow, is a testament to Unicode's good design again.
It's not until I ran into some of the "backwaters" of Unicode, realizing they had thought out and specified how to do things like "case-insensitive" normalized collation/comparison for a variety of different specifications in a localized and reasonably performant way...
Hate against Unicode frustrates me too. It's like people are privileged and don't know what it's like to have at least 2 code pages for your language. I _still_ have to fix files from those encodings and back into Unicode. Unicode is a blessing even if many don't see it.
It's the same with html and css: people shit on it all the time, but this just shows they don't have the imagination to see how much worse it could be.
Sort of related: I learned from reading about Facebook's lack of moderation that Myanmar is one of the few countries that doesn't use Unicode (and hence UTF-8). It uses something called Zawgyi that apparently has to be heuristically detected!
Facebook is despicable and indefensible. They knew that they could not moderate Myanmar. They knew or should have known that it was a volatile political situation. The amount of money involved could not have been more than a few million dollars. They should have just turned everything off and said we'll come back when we can. It's disgusting what they did and they should never be forgiven for putting market position ahead of human lives.
ttf is also nearly universally supported. Working with text correctly is just really hard so I think people like to reach for stuff that already does it for them.
I don't think it was always guaranteed to turn out well. China and Japan could have stayed with their own encodings. Microsoft and Apple could have done incompatible things. The tech world is full of bad things we're stuck with because there's no way to coordinate a change.
Unicode has it's flaws, UTF-16 is still lurking here and there, everyone loves to argue about emoji, but overall text just works now.