Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Occasionally, I have an irresistible urge to strangle everyone who uses Unicode and UTF-8 (or other UTF encodings) interchangeably.

UNICODE is a good thing because it provides a codepoint for every character that we care about, instead of having a 256-character subset for every groups of languages and needing complicated software to puzzle out how to convert from one subset to the other. Unicode allows fantastic stuff such as upper/lowercasing text including all the weird letters that you previously had to special-case.

ASCII used to be a good thing because it allowed people to ship around basic English and Cobol code without any worries, but is actually pretty evil because people from Anglosaxon countries assume that every other bit of text is composed of English and Cobol.

Having a notion of ENCODINGS is useful if you occasionally get bits of text that are neither English nor Cobol. You still needed different encodings for different groups of languages, and arcane mechanisms to provide hints on which encoding is meant, at least if you got non-English bits of text. The very notion of an Encoding scares the people who used to think the world consists of English and Cobol.

UTF-8 is a very reasonable encoding that can be used to represent all of Unicode while being Ascii-compatible. Hence it is a sane choice as a default encoding for people who are scared of having to think about encodings. Because UTF-8 is not the only encoding out there, Unicode-compatible programs accept Unicode text in many other encodings, including those that cannot represent the full range of Unicode and are only a good choice for some people but not others.

tl;dr: non-UTF-8 text can (and should) still be read as unicode codepoints. Ignoring the >=40% of texts out there or saying that they're "not Unicode" doesn't help anybody.



And to a programmer, what matters is whether the encoding is present and correct.

From a security perspective, if you don't include the right encoding meta-data, an attacker can include an XSS attack in UTF-7. Your server reads normal (but gibberish) ASCII, escapes it properly, then shleps it to the client who thinks it's UTF-7 (because of all the UTF-7 chars) and suddenly they are running malicious javascript that you didn't escape.

If you're scraping sites, the one thing more annoying than trying to guess the encoding, and that's dealing with multiple encodings in the same document.

Putting UTF-8 characters in a document doesn't mean you are using "Unicode". It means you are using some undeclared encoding, which is not a step forwards.


Re: UTF-7, is anyone on the world actually using that? I've only ever read about it in articles about security problems.


Certainly not on the web, it’s disallowed in the HTML5 spec:

User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings.

and major browsers removed support, e.g.:

https://bugzilla.mozilla.org/show_bug.cgi?id=414064


Not sure if this "counts," but it is still used to encode IMAP folder names. (Technically, the IMAP version is slightly different from standard UTF-7.)


I'm confused by that as well. I never really understood the motivation behind it - though I guess it's obsolescent, so hopefully it'll soon be just another one of those amusing anachronisms we see occasionally in computer science.


As I understand it, the motivation was to be able to transmit Unicode text over a channel that's only 7-bit safe (e.g. mail protocols) without having to do something silly like Base64-encoding the whole thing.


This post makes no sense to me.

Are you saying that we should still use character encodings rather than UTF encodings, or are you saying that we shouldn't assume that raw text is ASCII, or are you saying something else?

Unicode is, essentially, nothing unless it is encoded. When you encode it you must decide whether to use one byte, two bytes, three bytes or four bytes.

Only UTF-8 and UTF-32 are really big enough to hold the world's characters. everything else is a fudge.

ASCII was never Anglo-Saxon: it was always American. COBOL is a red herring here, too. ASCII was all about teleprinters and was a clever use of 8-Bits for its time. Bell labs invented ASCII and Bell labs invented UTF-8.

UTF-8 is much better than reasonable. It is a compact way to represent Unicode while preventing the western world from having to re-encode every text document. That's a lot of good news the internet.

Are you saying that we shouldn't ignore other charsets as they are still valid Unicode? If so I agree up to a point, the point being that there is no longer any need to have any other unicode encoding other than UTF-8. If you need to access your local characters as a byte array: choose your internal encoding and translate, do your magic, and then spit out UTF-8 again then we can all simply read the same documents without the need for over-complexity.


It makes sense to me. His point is that calling UTF-8 "Unicode" is wrong, that just confuses people. Windows programmers usually also call UTF-16 "Unicode" which is also wrong.

Unicode is not a text encoding, it's a standard that assigns a number to every character.Saying that a particular text is "Unicode" doesn't give any info on how to decode it, we should just say that a given text is UTF-7,8,16,32, etc.


> Only UTF-8 and UTF-32 are really big enough to hold the world's characters

So is GB18030.

Edit: and UTF-16, of course (just don't confuse it with UCS-2)


I'm guilty of using UTF16 when I really mean 16 bit unicode arrays. I see that now.

But it does beg the question why would you use UTF-16? Yes if you have more than 512 characters in your common script then OK I can see it might make sense, but not much. UTF-8 will still average out in a reasonable way.


Because it's the standard in the Windows API.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: