Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This pops up every so often, and is wrong on several fronts (UNIX is UTF-8, UTF-8/32 lexicographically sort, etc.) There's not really a good reason to support UTF-8 over UTF-16; you can quibble over byte order (just pick one) and you can try and make an argument about everything being markup (it's not), but the fact is that UTF-16 is a more efficient encoding for the languages a plurality of people use natively.

But more broadly, being able to assume $encoding everywhere is unrealistic. Write your programs/whatevers allowing your users to be aware of and configure encodings. It might not be ideal, but such is life.



> There's not really a good reason to support UTF-8 over UTF-16

Two big reasons:

1. All legal ASCII text is UTF-8. That means upgrading ASCII to UTF-8 to support i18n doesn't require you to convert all your files that were in ASCII.

2. UTF-16 gives people the mistaken impression that characters are fixed-width instead of variable-width, and this causes things to break horribly on non-BMP data. I've seen amusing examples of this.

> Write your programs/whatevers allowing your users to be aware of and configure encodings.

Internally, your program should be using UTF-8 (or UTF-16 if you have to for legacy reasons), and you should convert from non-Unicode charsets as soon as possible. But if you're emitting stuff... you should try hard to make sure that UTF-8 is the only output charset you have to support. Letting people select non-UTF-8 charsets for output adds lots of complication (now you have to have error paths for characters that can't be emitted), and you need to have strong justification for why your code needs that complication.


Every program that purports to support Unicode should be tested with a bunch of emoticons.


Do you mean emoji? I don't see what the issue would be with [{}:();P\[\],.<>/~-_+=XD]


Yes, that's what I meant. I knew I was using the wrong word but couldn't remember the right one.


> 1. All legal ASCII text is UTF-8. That means upgrading ASCII to UTF-8 to support i18n doesn't require you to convert all your files that were in ASCII.

Eh, realistically if you're doing this, you should be validating it like converting from one encoding to another anyway. I get that people won't and haven't, but that's because UTF-8 has this anti-feature where ASCII is compatible with it, and that's led to a lot of problems.

> 2. UTF-16 gives people the mistaken impression that characters are fixed-width instead of variable-width, and this causes things to break horribly on non-BMP data. I've seen amusing examples of this.

This is one of those problems, and it's way worse with UTF-8 because it encodes ASCII the same way ASCII does. It's let programmers stay naive about this stuff for... decades?

> Internally, your program should be using UTF-8 (or UTF-16 if you have to for legacy reasons), and you should convert from non-Unicode charsets as soon as possible.

There are all kinds of reasons to not use UTF-8. tialaramex pointed out one above. "UTF-8 everywhere" is simply unrealistic, and it forces a lot of applications to be slower, or to take on unnecessary complexity. Maybe it's worth it to "never have to think about encodings again", but that's pretty hard to verify and there's no way it happens in our lifetimes anyway.

> and you need to have strong justification for why your code needs that complication.

Yeah see, I strongly disagree with this. I'll choose whatever encoding I like, thanks. Maybe you don't mean to be super prescriptive here, but I think a little more consideration by UTF-8 advocates wouldn't hurt.


> I'll choose whatever encoding I like, thanks.

If everyone chooses whatever encoding they like, then the charset being used has to be encoded somewhere. The problem is, there are lots of places where charset isn't encoded (such as your filesystem). That this is a problem can be missed, because almost all charsets are a strict superset of ASCII (UTF-{7,16} are the only such charsets to be found in the top 99.99% of usage), so it's only when you try your first non-ASCII characters that problems emerge.

Unicode has its share of issues, but at this point, Unicode is the standard for dealing with text, and all i18n-aware code is going to be built on Unicode internally. The only safe way to handle text that has even the remotest change of being i18n-aware is to work with charsets that support all of Unicode, and given its compatibility with ASCII, UTF-8 is the most reasonable one to pick.

If you want to insist on using KOI-8, or ISO-2022-JP, or ISO-8859-1, you're implicitly saying "fuck you" to 2/3 of the world's population since you can't support tasks as basic as "let me write my name" for them.


> If everyone chooses whatever encoding they like, then the charset being used has to be encoded somewhere.

This is gonna be the case for the foreseeable future, as you point out. Settling on one encoding only fixes this like, 100 years from now. I'd prefer to build encoding-aware software that solves this problem now.

> given its compatibility with ASCII, UTF-8 is the most reasonable one to pick

This only makes sense of your system is ASCII in the first place, and if you can't build encoding-aware software. I think we can both agree that's essentially legacy ASCII software, so you don't get to choose anything anyway. And any system that interacts with it should be encoding-aware and still validate the encoding anyway, as though it might be BIG5 or whatever. Assuming ASCII/UTF-8 is a bad idea, always and forever.

> If you want to insist on using KOI-8, or ISO-2022-JP, or ISO-8859-1, you're implicitly saying "fuck you" to 2/3 of the world's population since you can't support tasks as basic as "let me write my name" for them.

I'm not obligated to write software for every possible user at every point in time. It's perfectly acceptable for me to say, "I'm writing this program for my 1 friend who speaks Spanish" and have that be my requirements. But if I were to write software that had a hope of being broadly useful, UTF-8 everywhere doesn't get me there. I'd have to build it to be encoding-aware, and let my users configure the encoding(s) it uses.


> But if I were to write software that had a hope of being broadly useful, UTF-8 everywhere doesn't get me there.

Actually, it does.

Right now, in 2020, if you're writing a new programming language, you can insist that the input files must be valid UTF-8 or it's a compiler error. If you're writing a localization tool, you can insist that the localization files be valid UTF-8 or it's an error. Even if you're writing a compiler for an existing language (e.g., C), it would not be unreasonable to say that the source file must be valid UTF-8 or it's an error--and let those not using UTF-8 right now handle it by converting their source code to use UTF-8. And this has been the case for a decade or so.

That's the point of UTF-8 everywhere: if you don't have legacy concerns [someone actively using a non-ASCII, non-UTF-8 charset that you have to support], force UTF-8 and be done with it. And if you do have legacy concerns, try to push people to using UTF-8 anyways (e.g., default to UTF-8).


I can't insist that other systems send your program UTF-8, or that the users' OS use UTF-8 for filenames and file contents, or that data in databases uses UTF-8, or that the UTF-8 you might get is always valid. The end result of all these things you're raising is "you can't assume, you have to check always, UTF-8 everywhere buys you nothing". Even if we did somehow get there, you'd still have to validate it.


> not really a good reason to support UTF-8 over UTF-16

Of course there is, the fact that if you're dealing only with ASCII characters then it's backwards-compatible. Which is a nice convenience in a great number of situations programmers encounter.

The minor details of efficiency of an encoding these days isn't particularly relevant -- sure UTF-16 is better for Chinese, but the average webpage usually does have way more markup, CSS and JavaScript than text, and gzip-ing it on delivery will result in a similar payload totally independent of the encoding you choose.


UTF-8's ASCII compatibility is an anti-feature; it's allowed us to continue to use systems that are encoding naive (in practice ASCII-only). It's no substitute for creating encoding-aware programs, libraries, and systems.

The vast majority of text is not in HTML or XML, and there's no reason you can't use Chinese characters in JavaScript besides (your strings and variable/class/component/file names will surely outpace your use of keywords).


It's not an anti-feature, it's a benefit that is a huge asset in the real world. For example, you can be on a legacy ASCII system, inspect a modern UTF-8 file, and if it's in a Latin language then it will still be readable as opposed to gibberish. Yes all modern tools should be (and these days generally are) encoding-aware, but in the real world we're stuck with a lot of legacy tools too.

And of course the vast majority of transmitted digital text is in HTML and similar! What do you think it's in instead?

By sheer quantity of digital words consumed by the average person, it's news and social media delivered in browsers (HTML), followed by apps (still using HTML markup to a huge degree) and ebooks (ePub based on HTML). And of course plenty of JSON and XML wrapping too.

And of course you can you Chinese characters in JavaScript/JSON, but development teams are increasingly international and English is the de-facto lingua franca.


That huge asset has become a liability. We always needed to become encoding-aware, but UTF-8's ASCII compatibility has let us delay it for decades, and caused exactly the confusion causing us to debate right now. So many engineers have been foiled by putting off learning about encodings. Joel Spolsky wrote an article, Atwood wrote an article, Python made a backwards incompatible change, etc. etc. etc.

To be honest, I'm just guessing about what text is stored in--I'll cop to it being very hard to prove. But my guess is the vast majority of text is in old binary formats, executables, log files, firmware, or in databases without markup. That's pretty much all your webpages right there.

n.b. JSON doesn't really fit the markup argument. The whole idea is that HTML is super noisy and the noise is 1 byte in UTF-8, and 2 bytes in UTF-16. JSON isn't noisy so the overhead is very low.


I just don't know what you're talking about.

You can't rewrite all existing legacy software to support encodings. You just can't. A backwards-compatible format was a huge catalyst for widely supporting Unicode in the first place. What exactly are we delaying for decades? Engineers everywhere use Unicode today for new software. The battle has been won, moving forwards.

And the vast majority of text isn't in computer code or even books. It's in the seemingly endless stream of content produced by journalists and social media each and every day, dwarfing executables, firmware, etc. And if it supports any kind of formatting (bold/italics etc.) -- which most does -- then it's virtually always stored in HTML or similar (XML). I mean, what are even the alternatives? Neither RTF nor Markdown come even close in terms of adoption.


> You can't rewrite all existing legacy software to support encodings. You just can't. A backwards-compatible format was a huge catalyst for widely supporting Unicode in the first place.

Totally agree.

> What exactly are we delaying for decades?

Learning how encodings work and using that knowledge to write encoding-aware software.

> Engineers everywhere use Unicode today for new software. The battle has been won, moving forwards.

They do, but they're frequently foiled by on-disk encodings, filenames, internal string formats, network data, etc. etc. etc. All this stuff is outlined in TFA.

> And the vast majority of text isn't in computer code or even books. It's in the seemingly endless stream of content produced by journalists and social media each and every day

I concede I'm not likely to convince you here, but like, do you think Twitter is storing markup in their persistence layer? I doubt it. And even if there is some formatting, we're talking about <b> here, not huge amounts of angle brackets.

But think about any car display. That's probably not markup. Think about ATMs. Log files. Bank records. Court records. Label makers. Airport signage. Road signage. University presses.


The reasons most programmers use English in their source code has nothing to do with file size (for that their are JS minimizes) or supported encodings. It has to do with that two things, English is the most used language in the industry so if you want to cooperate with programmers from other parts of the world English is a good idea and because it frankly looks ugly to mix languages in the same file so when the standard library is in English your source code will be too.

So since most source code is in English (and for JS is minimized) UTF-8 works perfectly there too.


I think it's quite obvious that UTF-8 is the better choice over UTF-16 or UTF-32 for exchanging data (if just for the little/big endian mess alone, and that UTF-16 isn't a fixed-length encoding either).

From that perspective, keeping the data in UTF-8 for most of its lifetime also when loaded into a program, and only convert "at the last minute" when talking to underlying operating system APIs makes a lot of sense, except for some very specific application types which do heavy text processing.


I'm gonna do little quotes but, I don't mean to be passive aggressive. It's just that this stuff comes up all the time

> I think it's quite obvious that UTF-8 is the better choice over UTF-16 or UTF-32 for exchanging data (if just for the little/big endian mess alone...

This should be the responsibility of a string library internally, and if you're saving data to disk or sending it over the network, you should be serializing to a specific format. That format can be UTF-8, or it can be whatever, depending on your application's needs.

> and that UTF-16 isn't a fixed-length encoding either)

We should stop assuming any string data is a fixed-length encoding. This is a major disadvantage of UTF-8, because it allows for this conflation.

> keeping the data in UTF-8 for most of its lifetime also when loaded into a program, and only convert "at the last minute" when talking to underlying operating system APIs makes a lot of sense, except for some very specific application types which do heavy text processing.

Well, you're essentially saying "I know about your use case better than you do". It might be important to me to not blow space on UTF-8. But if my platform/libraries have bought into "UTF-8 everywhere" and don't give me knobs to configure the encoding, I have no recourse.

And that's the entire basis for this. It's "having to mess with encodings is worse than the application-specific benefits of being able to choose an encoding". I think that's... at best an impossible claim and at worst pretty arrogant. Again here I don't mean you, but this "UTF-8 everywhere" thing.


>We should stop assuming any string data is a fixed-length encoding. This is a major disadvantage of UTF-8, because it allows for this conflation.

Mistaking a variable-width encoding for a fixed-width one is specifically a UTF-16 problem. UTF-8 is so obviously not fixed-width that such an error could not happen by a mistake, because even before widespread use of emojis, multibyte sequences were not in any way a corner case for UTF-8 text (for additional reference, compare UTF-16 String APIs in Java/JavaScript/etc. with UTF-8 ones in, say, Rust and Go, and see which ones allow you to easily split a string where you shouldn't be able to, or access "half-chars" as a datatype called "char".)


I mean, I think we're both in the realm of [citation needed] here. I would argue that people index into strings quite a lot--whether that's because we thought UCS-2 would be enough for anybody or UTF-8 == ASCII and "it's probably fine" is academic. The solution is the same though: don't index into strings, don't assume an encoding until you've validated. That makes any "advantage" UTF-8 has disappear.

If you really think no one made this mistake with UTF-8, just read up on Python 3.


The difference is that with UTF-8 you're much more likely to trip over those bugs in random testing. With UTF-16 you're likely to pass all your test cases if you didn't think to include a non-BMP character somewhere. Then someone feeds you an emoji character and you blow up.


Which is why you should be using a library for all this, that uses fuzzing and other robustness checks.


> We should stop assuming any string data is a fixed-length encoding. This is a major disadvantage of UTF-8, because it allows for this conflation.

So what do you suggest? UTF-16 and UTF-32 encourage this even more.


Yeah, ASCII is such a powerful mental model that I think anyone working with Unicode made a lot of concessions to convert people, no argument there. But I think we need to say we're done with that and move on to phase 2. Here's what I advocate:

- Encodings should be configurable. Programmers get to decide what format their strings are internally, users get to decide what encoding programs use when dealing with filenames or saving data to disk, etc. Defaults matter, and we should employ smarts, but we should never say "I know best" and remove those knobs.

- Engineers need to internalize that "strings" conceal mountains of complexity (because written language is complex), and default to using libraries, to manage them. We should start view manual string manipulation as an anti-pattern. There isn't an encoding out there that we can all standardize on that makes this untrue, again because written language is complex.


But is it really a plurality? Portuguese, English, Spanish, Turkish, Vietnamese, French, Indonesian and German are stored more efficiently in UTF-8 while Chinese, Korean and Japanese are stored less effeciently. My gut feel is that more people use the Latin script than people using CJK scripts. Indic scripts, Thai, Cyrillic, etc are stored using two bytes in both UTF-8 AND UTF-16.

And thus ignores markup which is in ascii.


Looking at the basic multilingual plane [1], UTF-8 will use > 2 bytes to encode essentially anything that isn't:

* ASCII/Latin

* Cyrillic

* Greek

* Most of Arabic

That leaves out:

* China

* India

* Japan

* Korea

* All of Southeast Asia

Re: markup, think about any text that's in a database, stored in RAM, or stored on a disk--relatively little of it will be in noisy ASCII markup formats like HTML or XML.

[1]: https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilin...


> All of Southeast Asia

Did you forget Indonesia, Vietnam, Malaysia, Brunei and the Philippines?


Again, here's what UTF-8 will use <= 2 bytes for:

Basic Latin (Lower half of ISO/IEC 8859-1: ISO/IEC 646:1991-IRV aka ASCII) (0000–007F)

Latin-1 Supplement (Upper half of ISO/IEC 8859-1) (0080–00FF)

Latin Extended-A (0100–017F)

Latin Extended-B (0180–024F)

IPA Extensions (0250–02AF)

Spacing Modifier Letters (02B0–02FF)

Combining Diacritical Marks (0300–036F)

Greek and Coptic (0370–03FF)

Cyrillic (0400–04FF)

Cyrillic Supplement (0500–052F)

Armenian (0530–058F)

Aramaic Scripts:

    Hebrew (0590–05FF)

    Arabic (0600–06FF)

    Syriac (0700–074F)

    Arabic Supplement (0750–077F)

    Thaana (0780–07BF)

    N'Ko (07C0–07FF)
In UTF-8, everything over U+0800 requires > 2 bytes. Am I misunderstanding something? It's possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: