Occasionally, I have an irresistible urge to strangle everyone who uses Unicode ...

wisty · on Feb 6, 2012

And to a programmer, what matters is whether the encoding is present and correct.

From a security perspective, if you don't include the right encoding meta-data, an attacker can include an XSS attack in UTF-7. Your server reads normal (but gibberish) ASCII, escapes it properly, then shleps it to the client who thinks it's UTF-7 (because of all the UTF-7 chars) and suddenly they are running malicious javascript that you didn't escape.

If you're scraping sites, the one thing more annoying than trying to guess the encoding, and that's dealing with multiple encodings in the same document.

Putting UTF-8 characters in a document doesn't mean you are using "Unicode". It means you are using some undeclared encoding, which is not a step forwards.

wladimir · on Feb 6, 2012

Re: UTF-7, is anyone on the world actually using that? I've only ever read about it in articles about security problems.

robin_reala · on Feb 6, 2012

Certainly not on the web, it’s disallowed in the HTML5 spec:

User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings.

and major browsers removed support, e.g.:

https://bugzilla.mozilla.org/show_bug.cgi?id=414064

dmbaggett · on Feb 6, 2012

Not sure if this "counts," but it is still used to encode IMAP folder names. (Technically, the IMAP version is slightly different from standard UTF-7.)

chimeracoder · on Feb 6, 2012

I'm confused by that as well. I never really understood the motivation behind it - though I guess it's obsolescent, so hopefully it'll soon be just another one of those amusing anachronisms we see occasionally in computer science.

lilyball · on Feb 7, 2012

As I understand it, the motivation was to be able to transmit Unicode text over a channel that's only 7-bit safe (e.g. mail protocols) without having to do something silly like Base64-encoding the whole thing.

sambeau · on Feb 6, 2012

This post makes no sense to me.

Are you saying that we should still use character encodings rather than UTF encodings, or are you saying that we shouldn't assume that raw text is ASCII, or are you saying something else?

Unicode is, essentially, nothing unless it is encoded. When you encode it you must decide whether to use one byte, two bytes, three bytes or four bytes.

Only UTF-8 and UTF-32 are really big enough to hold the world's characters. everything else is a fudge.

ASCII was never Anglo-Saxon: it was always American. COBOL is a red herring here, too. ASCII was all about teleprinters and was a clever use of 8-Bits for its time. Bell labs invented ASCII and Bell labs invented UTF-8.

UTF-8 is much better than reasonable. It is a compact way to represent Unicode while preventing the western world from having to re-encode every text document. That's a lot of good news the internet.

Are you saying that we shouldn't ignore other charsets as they are still valid Unicode? If so I agree up to a point, the point being that there is no longer any need to have any other unicode encoding other than UTF-8. If you need to access your local characters as a byte array: choose your internal encoding and translate, do your magic, and then spit out UTF-8 again then we can all simply read the same documents without the need for over-complexity.

jpablo · on Feb 7, 2012

It makes sense to me. His point is that calling UTF-8 "Unicode" is wrong, that just confuses people. Windows programmers usually also call UTF-16 "Unicode" which is also wrong.

Unicode is not a text encoding, it's a standard that assigns a number to every character.Saying that a particular text is "Unicode" doesn't give any info on how to decode it, we should just say that a given text is UTF-7,8,16,32, etc.

finnw · on Feb 6, 2012

> Only UTF-8 and UTF-32 are really big enough to hold the world's characters

So is GB18030.

Edit: and UTF-16, of course (just don't confuse it with UCS-2)

sambeau · on Feb 6, 2012

I'm guilty of using UTF16 when I really mean 16 bit unicode arrays. I see that now.

But it does beg the question why would you use UTF-16? Yes if you have more than 512 characters in your common script then OK I can see it might make sense, but not much. UTF-8 will still average out in a reasonable way.

jpablo · on Feb 7, 2012

Because it's the standard in the Windows API.