Unicode in Python 3

wbond · on May 12, 2014

Having written a bunch of Python 2 and porting it to 3 where I deal with unknown encodings (FTP servers), I can't help but disagree with Armin on most of his Python 3 posts.

The crux of his argument with this article is "unix is bytes, you are making me deal with pain to treat it like Unicode." Python 2 just allowed to take crap in and spit crap out. Python 3 requires you to do something more complicated when crap comes in. In my situation, I am regularly putting data into a database (PostgreSQL with UTF-8 encoding) or working with Sublime Text (on all three platforms). You try to pass crap along to those and they explode. You HAVE to deal with crappy input.

In my experience, Python 2 explodes at run time when you get weird crappily-encoded data. And only your end users see it, and it is a huge pain to reproduce and handle. Python 3 forces you to write code that can handle the decoding at the get go. By porting my Python 2 to 3, I uncovered a bunch of places where I was just passing the buck on encoding issues. Python 3 forced me to address the issues.

I'm sure there are bugs and annoyances along the way with Python 3. Oh well. Dealing with text input in any language is a pain. Having worked with Python, C, Ruby and PHP and dealing with properly handling "input" for things like FTP, IMAP, SMTP, HTTP, etc, yeah, it sucks. Transliterating, converting between encodings, wide chars, Windows APIs. Fun stuff. It isn't really Python 3 that is the problem, it is undefined input.

Unfortunately, it seems Armin happens to play in areas where people play fast and loose (or are completely oblivious to encodings). There is probably more pain generally there than dealing with transporting data from native UI widgets to databases. Sorry dude.

Anyway, I never write Python 2 anymore because I hate having this randomly explode for end-users and having to try and trace down the path of text through thousands of lines of code. Python 3 makes it easy for me because I can't just pass bytes along as if they were Unicode, I have to deal with crappy input and ask the user what to do.

Python 2 is a dead end with all sorts of issues. The SSL support in Python 2 is a joke compared to 3. You can't re-use SSL contexts without installing the cryptography package, which requires, cffi, pycparsers and bunch of other crap. Python 2 SSL verification didn't exist unless you roll your own, or use Requests. Except Requests didn't even support HTTPS proxies until less than a year ago.

Good riddance Python 2.

the_mitsuhiko · on May 12, 2014

> Python 3 requires you to do something more complicated when crap comes in.

Or in most cases: Python 3 falls flat on the floor with all kinds of errors because you did not handle unicode with one of the many ways you need to handle it.

On Python 2 you decoded and encoded. On Python 3 you have so many different mental models you constantly need to juggle with. (Is it unicode, is it latin1 transfer encoded unicode, does it contain surrogates) and then for each of them you need to start thinking where you are writing it to. Is it a bytes based stream? then surrogate errors can be escaped and might result in encoding garbage same as in python 2. If it a text stream? Then that no longer works then you can either crash or write different garbage. If it's latin1 transfer encoded then most people don't even know that they have garbage. I filed lots of bugs against that in WSGI libs.

If you write error free Python 3 unicode code, then teach me. (Or show me your repo and I show you all the bugs you now have)

zzzeek · on May 12, 2014

> (Or show me your repo and I show you all the bugs you now have)

this would be great. Show me! I'd love to know:

https://bitbucket.org/zzzeek/sqlalchemy/

https://bitbucket.org/zzzeek/mako/

https://bitbucket.org/zzzeek/alembic/

I'm guessing you'd go for Mako first since it has the most unicode intense stuff going on (and it uses lots of your code).

the_mitsuhiko · on May 12, 2014

As an example mako cli. You can call this an error or not, but with C locale your cmdline will die with UnicodeErrors when you open a non existing file with unicode filename on Python 3 but not so on Python 2 where it will do the correct thing. It will also die with unicode errors under the same situation when your template renders any unicode characters. Again, something that probably works fine on python 2 and correctly.

Or if you would put unicode characters into your README.rst you could no longer safely install mako. Again, Python 3 only.

These are just two things I found on github.

Another easy one: alembic README's now no longer can safely contain unicode. They would break on Python 3, but work just fine on Python 2 because of the code in list_templates.

zzzeek · on May 12, 2014

the cmdline template runner at the moment isn't doing unicode in Py2K either, crashes there too.

wbond · on May 12, 2014

I would not be surprised if you can construct contrived examples of how Python 3 can be broken. In my experience, writing real life code, I ship more stable software writing in Python 3 than Python 2.

I mostly work with subprocesses or directly reading data from socket connections, and I run all of my bytes through strict mode. If something doesn't decode properly, an error is returned. Currently I am working on an interactive way (inside of Sublime Text) to present to the user a way to see text in different encodings so they can help debug the issue on their own.

So, yes, you need to write helper functions and have an interface to deal with properly handling encodings. This has been my experience in every language I've ever worked in. I can't imagine there is a way around it. Is this a reason Python 3 sucks compared to 2? Not in my experience. I had far more issues in Python 2 with encodings and not being sure what other libraries and packages had done in regards to handling unicode data. Hmm, so ftplib accepts unicode for filenames. Does it encode it? What encoding does it use? Oh, look at that, it has just been coercing to ascii because it can.

So yeah, writing a simple little toy command line app needs more boiler-plate to deal with unicode. Any real app is going to need that and a ton more. And you are going to have to decide how to error with encodings, and how to let users identify encodings. And you are going to need to write a global exception handler for Python to capture unexpected exceptions and log them to a file so users can send crash reports. Yay, sys.excepthook!

But anyway, I think it all comes back to the fact that I know what I am dealing with far more quickly with Python 3 than with 2. Again, maybe because I don't write apps that deal with local file paths (expect abstracted through a subprocess).

Unfortunately, most of the code where I deal with crappy encodings from FTP servers and SVN is closed source. The open source stuff is at https://github.com/wbond.

muyuu · on May 13, 2014

> In my experience, writing real life code, I ship more stable software writing in Python 3 than Python 2.

Real life code is not the same for everybody.

twic · on May 12, 2014

There was a related discussion on the Mercurial mailing list a while back. Not about Python 2 vs 3, but about filename encoding.

Mercurial follows a policy of treating filenames as byte strings. Matt Mackall is very clear about this. Because unix treats filenames as byte strings, this makes Mercurial interoperate with other programs on a unix machine pretty well: you can manage files of any encoding, you can embed filenames in file contents (eg in build scripts) and be confident they will always be byte-for-byte identical with the names managed by Mercurial, etc.

However, it also means Mercurial falls flat on its face when it's asked to share files between machines using different encodings. Names which work fine on one machine will, to human eyes, be garbled nonsense on the other.

This is a problem which does actually happen; there is a slow trickle of bug reports about it. And because of the commitment to unix-style filenames, it will probably never be fixed. List members did try and come up with some ideas to fix it which preserved the unix semantics normal cases, but they weren't popular.

And before anyone gets lippy, i assume Git has the same problem.

Ultimately, i would say this comes down to a conflict between two fundamentally different kinds of users of strings: machines and people. Machines are best served by strings of bytes. People are best served by strings of characters. Usually. And sadly, unix's lack of a known filesystem encoding is too well-established for there to be much chance of building a bridge.

andreasvc · on May 12, 2014

What do you mean by "share files between machines"? Do you mean over a protocol? In that case the protocol over the wire should be well-defined and would avoid problems. If you mean by sharing files over an USB-stick then it's not so much an application problem as an OS issue.

I don't think the argument about machines wanting bytes is true. Machines will accept anything as long as it is well-defined. I'm really curious why there isn't yet some Linux or Posix standard that mandates utf-8. What's the problem with just decreeing that version +1 of the standard now expects utf-8?

twic · on May 12, 2014

What do you mean by "share files between machines"?

Commit files into a repository on one machine. Move it to another on a USB stick, by FTP, with the DVCS's transport protocol, whatever. All of those result in repositories containing byte-for-byte identical commits.

In that case the protocol over the wire should be well-defined and would avoid problems.

Oh, all of these are well-defined. They're defined to produce filenames which comprise the same sequence of bytes everywhere. That's the problem!

If you mean by sharing files over an USB-stick then it's not so much an application problem as an OS issue.

Bear in mind that the problem is not what the OS does with the names of files being moved around, it's with what the DVCS does with the names that are embedded in the content of its data files.

andreasvc · on May 12, 2014

Ah, thanks for clarifying.

marcosdumay · on May 13, 2014

Well, honestly, the other way (interpreting filenames) would create worse interoperability problems.

PeterisP · on May 12, 2014

It seems that git doesn't have this problem - just tried adding an encoding-sensitive nonASCII filename, and it worked correctly when pulled between different operating systems (macos, win7, ubuntu).

twic · on May 12, 2014

What encodings were you using on those machines? The operating system is irrelevant; it's the encoding that matters.

overgard · on May 12, 2014

I had to deal with this a lot at a job I used to have (not python specifically, but just with unicode issues), and there's really just not a right answer to how to do any of this. Any solution you pick is going to suck for someone.

One thing he's leaving out of the Python 2 being better aspect: Ok, for cat you can treat everything as one long byte array. But what if, say, I need to count how many characters are in that string? Or what if I need to write a "reverse cat", which reverses the string? Python 2's model is entirely broken there.

Armin suggests that printing broken characters is better than the application exploding and I agree.. sometimes. On the other hand, try explaining to a customer why the junk text they copy pasted from microsoft word into an html form has question marks in it when it shows on your site.

The problem with the whole "treat everything as bytes" thing is that you'll never have a system that quite works. You'll just have a system that mostly works, and mostly for languages closer to english. Going the rigorous route is the hard way, but it will end up with systems that actually work right.

rdtsc · on May 12, 2014

> There is a perfectly other language available called Python 2, it has the larger user base and that user base is barely at all migrating over. At the moment it's just very frustrating.

I come from a different perspective, I looked at the benefits of Python 3 and looked at my existing code base and how it would be better if was written in Python 3 and apart from bragging rights, and having a few built-in modules (that now I get externally) it wouldn't actually be better.

To put it plainly, Python 3, for me, doesn't offer anything at the moment. There is no carrot at the end. I have not seen any problems with Unicode yet. Not saying they might not be lurking there, I just haven't seen them. And, most important, Python 2 doesn't have any stick beating me on the head with, to justify migrating away from it. It is just a really nice language, fast, easy to work with, plenty of libraries.

From from _my_ perspective Python 3 came at the wrong time and offered the wrong thing. I think it should have happened a lot earlier, I think to justify incompatibilities it should have offered a lot more, for example:

* Increased speed (a JIT of some sort)

* Some new built-in concurrency primitives or technologies (something greenlet or message passing based).

* Maybe a built-in web framework (flask) or something like requests or ipython.

It is even hard to come with a list, just because Python 2 with its library ecosystem is already pretty good.

kunstmord · on May 12, 2014

Most libraries support Python 3 (http://py3readiness.org/). 3.4 added asyncio, 3.3 added the 'yield from'. Python 3 has a saner print statement, type annotations. I guess that it isn't that much, but I moved mostly due to the fact that a) the libraries I need support python3 anyway b) the stuff I write might need 'yield from' c) what if Python 3.x adds some awesome feature that you'll need, and by that time, you're stuck with a ton of code you need to convert?

tormeh · on May 12, 2014

Well, the payoff is probably not that great, but how much effort is really required to move to 3? Rewriting print statements and changing a couple import statements? Anything else? There's no carrot and no stick, but you're only being asked to stand up for a second so someone can switch your chair into something more comfortable. You're not exactly rewriting it in Perl.

rdtsc · on May 12, 2014

> but how much effort is really required to move to 3? Rewriting print statements and changing a couple import statements?

Most important -- risk. Risk that stuff will break. One of the biggest ones is change to the .keys(), .values() to behave like iterkeys.

Also unicode vs byte strings.

Also time. And time = $ in most places.

So far need the benefits are just not there. It is something like this:

benefit(switch) = code_improvements - time - risk + possible_future_benefits

(time has an opportunity cost folded into it, if I am porting to 3 I am not working on other stuff).

So far benefit is either negative or just too small for me.

zzzeek · on May 12, 2014

> Most important -- risk. Risk that stuff will break.

that's what test coverage is for. if you don't have coverage, then your code is already broken.

rdtsc · on May 12, 2014

Well customers pay for it, use it and like it. That my book qualifies it as not broken. They can chose other software but they pick this one.

Also, just because unit tests cover the code and pass doesn't mean product is not broken. Two working units of code adding together in a system don't guarantee that system will do what it is supposed to do. So yes there is risk.

The bigger problem is that there are not tangible benefits of Python 3. That is its tragedy the way I see it.

And time-wise, it is pretty sad, it might take me less than a few days to work through it, but it is still not worth it.

ak217 · on May 12, 2014

Is sys.getfilesystemencoding() not a good way to get at filename encoding?

I think on the face of it I do like the Go approach of "everything is a byte string in utf-8" a lot, but I haven't really worked with it so there's probably some horrible pain there somewhere, too. In the meantime Python 3 is a hell of a lot better than Python 2 to me because it doesn't force unicode coercion with the insane ascii default down my throat (by the time most new Python 2 coders realize what's going on, their app already requires serious i18n rework). Also, I don't really know why making sure stuff works when locale is set to C is important - I would simply treat such a situation as broken.

In writing python 2/3 cross-compatible code, I've done the following things when on Python 2 to stay sane:

- Decode sys.argv asap, using sys.stdin.encoding

- Wrap sys.stdin/out/err in text codecs from the io module (https://github.com/kislyuk/eight/blob/master/eight/__init__....). This approximates Python 3 stdio streams, but has slightly different buffering semantics compared to Python 2 and messes around with raw_input, but it works well. Also, my wrappers allow passing bytes on Python 2, since a lot of things will try to do so.

taejo · on May 12, 2014

> I think on the face of it I do like the Go approach of "everything is a byte string in utf-8" a lot, but I haven't really worked with it so there's probably some horrible pain there somewhere, too.

The problem with "everything is a byte string in utf-8" is simply that it's false. Some byte strings are in UTF-16, some are in Big5, and some aren't text at all. I assume that the intention is that all non-utf-8 input gets converted as soon as possible and all non-utf-8 output as late as possible; this is essentially the Python 3 idea, except with a type system that tells you when you messed it up. I've seen Python 2 projects that used this approach, but I prefer to have an exception thrown as soon as I make a mistake (instead of choking on a Chinese HTML file three months later, or throwing up mojibake)

pcwalton · on May 12, 2014

> I think on the face of it I do like the Go approach of "everything is a byte string in utf-8" a lot

That isn't actually the case in Go: strings do not enforce UTF-8. However, the libraries that deal with text, as well as the compiler, are opinionated about treating strings as UTF-8.

the_mitsuhiko · on May 12, 2014

getfilesystemencoding() is unreliable on Linux has linux has no file system encoding. It just returns the first match of LC_ALL, LC_CTYPE, LANG (not sure in which order).

pilif · on May 13, 2014

I wouldn't call it unreliable then because whatever LC_CTYPE is set to is what the user expects their file names to be interpreted as.

If the contents of LC_CTYPE is wrong for a particular file name, at least you get consistency between your python program and everything else on the system.

inklesspen · on May 12, 2014

If you want to work with bytes on stdin and stdout, Python 3 documents how to do that, at the same place it documents the stdin and stdout streams.

https://docs.python.org/3/library/sys.html#sys.stdin

All you have to do is use sys.stdin.buffer and sys.stdout.buffer; the caveat is that if sys.stdin has been replaced with a StringIO instance, this won't work. But in Armin's simple cat example, we can trivially make sure that won't happen.

I'd be a lot more willing to listen to this argument if it didn't overlook basic stuff like this.

DasIch · on May 12, 2014

Yes, the documentation mentions that you can use buffer and follows that by a sentence explaining that you can do this unless you can't:

> Note that the streams may be replaced with objects (like io.StringIO) that do not support the buffer attribute or the detach() method and can raise AttributeError or io.UnsupportedOperation.

So no this is neither basic nor easy to do correctly in general. That's only the case, if you are writing an application and use well-behaved libraries that handle the edge cases you introduce.

CatMtKing · on May 12, 2014

I guess it's a little odd that Python 3 treats stdin and stdout by default as unicode text streams. And sys.argv is a list of unicode strings, too, instead of bytes.

pekk · on May 12, 2014

I like Python 3's unicode handling but I agree that this seems strange. It is because people expect to see "characters" from these interfaces after treating them as ASCII-only for so long. If Python 3 had insisted on real purity with bytes objects I think it would have died a long time ago. Which is sad.

andreasvc · on May 12, 2014

> I'd be a lot more willing to listen to this argument if it didn't overlook basic stuff like this.

Just because there's a way around this particular issue doesn't mean that the attitude of Unicode by default of Python 3 isn't problematic. There's also sys.argv, os.listdir, and other filename stuff which Python 3 attempts to decode.

mangecoeur · on May 12, 2014

I get that Armin runs into pain points with Py3, but on the other hand I get annoyed with the heavily English centric criticims - its easy to think py2 was better when you're only ever dealing with ASCII text anyway.

Fact is, most of the world doesn't speak english and needs accents, symbols, or completely different alphabets or characters to represent their language. If POSIX has a problem with that then yes, it is wrong.

Even simple things like french or german accents can make the Py2 csv module explode, while Py3 works like a dream. And anyone who thinks they can just replace accented characters with ASCII equivalents needs to take some language lessons - the result is as borked and nonsensical as if, in some parallel univese, I had to replace every "e" with an "a" in order to load simple english text.

Thrymr · on May 12, 2014

Armin is Austrian. Whatever else you think of his critique, it's probably not English-centric and he's had to deal with accents in his native language.

the_mitsuhiko · on May 12, 2014

My libraries are all supporting unicode on Python 2. And in fact, they do it better than on Python 3. File any unicode bugs you might encounter on Python 2 against me please.

mangecoeur · on May 13, 2014

This is true, and in fairness I've never had any problems with unicode using any of your libraries, probably because you take a lot of care in explicitly dealing with encoding.

But that's not always the case with other libraries like the csv module. The core unicode support in py3 means that a lot of librares which are not written with explicit unicode in mind Just Work with it in py3, and its a huge time saver.

andreasvc · on May 12, 2014

This is not what the blog post is saying. It is saying that Python 3's attitude of forcing Unicode is making life difficult, whereas in Python 2 it is easier to decode to Unicode where needed, and be able to accept non-Unicode data in other cases. That Unicode needs to be supported was never under discussion.

abus · on May 13, 2014

I've used the Py2 csv module with csv files containing accented characters and had no issues. Could you post an example?

This works fine, as does the corresponding writer:

    reader = csv.reader(open(FILENAME, 'rb'))
    for row in reader:
        print row

If you want a unicode string it's as simple as:

    value = row[0].decode('utf8')

Then before writing:

    row[0] = value.encode('utf8')

lmm · on May 12, 2014

If you're happy with Go's "everything is a unicode string" approach then you should be happy to just treat everything as unicode. Don't handle the decode errors - if someone sends some data to your stdin that's not in the correct encoding, too bad.

Yes, python3 makes it hard to write programs that operate on strings as bytes. This is a good thing, because the second you start to do anything more complicated than read in a stream of bytes and dump it straight back out to the shell (the trivial example used here), your code will break. Unix really is wrong here, and the example requirement would seem absurd to anyone not already indoctrinated into the unix approach: you want a program that will join binary files onto each other, but also join strings onto each other, and if one of those strings is in one encoding and one is in another then you want to print a corrupt string, and if one of them is in an encoding that's different from your terminal's then you want to display garbage? Huh? Is that really the program you want to write?

chimeracoder · on May 12, 2014

> If you're happy with Go's "everything is a unicode string" approach then you should be happy to just treat everything as unicode.

That's actually not really Go's approach. In Go, strings do not have encodings attached to them.

Source files are defined to be UTF-8 (by the compiler), so string literals are always unicode. That's not quite the same thing as saying that the "string" type in Go is always Unicode (it's not). And when you're dealing with a byte slice ([]byte), you cannot make any assumptions about the encoding.

It took a bit to wrap my head around this when I first read about it[0], but now that I think about it, I think it's the right way to go[1].

[0] http://blog.golang.org/strings

[1] And for what it's worth, Go and UTF-8 were designed by (some of) the same people, so one would hope they'd get it right!

lmm · on May 12, 2014

You're right, I was lazily responding to the article on its own terms rather than engaging properly with go's string handling.

I think the approach of keeping strings encoded has promise but it would need to be supported by a stronger type system than go's. When you're carrying around two encoded byte arrays it's really important that you know what their encodings are and don't try and e.g. concatenate them. Ruby can do this right because it can give you a runtime type error, but that's not acceptable in a compiled language. So you need to distinguish between byte arrays with statically known encodings, byte arrays with dynamically known encodings and byte arrays with unknown encodings. And you should really e.g. disallow slicing a byte array that represents a string, so that you don't cut a character in half.

I know that one of the UTF-8 guys worked on go and I'm sure go will work well when everything's in UTF-8. But all languages work well when everything's in UTF-8; if anything this makes me more worried that go's authors won't give proper support to those who have to work with strings in non-UTF8 encodings. (By contrast one of the reasons Ruby's support is good is that the author is Japanese and therefore pretty much has to work with strings in multiple non-unicode encodings, because of han unification)

burntsushi · on May 12, 2014

> So you need to distinguish between byte arrays with statically known encodings, byte arrays with dynamically known encodings and byte arrays with unknown encodings.

I don't buy this. You've just introduced some rather large complexity into the types of byte strings just for the sake of handling non-UTF8 cases, which seem to be getting less common than they used to be.

Rust is taking a similar approach to Go (except that their string type must be valid UTF8). You can see a recent debate here: https://mail.mozilla.org/pipermail/rust-dev/2014-May/009725....

lmm · on May 12, 2014

If you don't care about handling non-UTF8 cases, you can use pretty much any language (python3 included - the issues the OP is complaining about are when you have filenames in a different encoding from your terminal or the like), write the obvious thing, and it will work fine.

For many use cases that's good enough. But the cases where languages are different, the cases where it gets interesting, are when that isn't enough. (And it won't be enough if you want to sell your software in Japan, for example).

pcwalton · on May 12, 2014

There's actually a bit more to what Rust does: there is a well-known community library called rust-encoding that adds new string types that support various encodings. You can use this library if you need to support other encodings. The standard library supports only UTF-8, but it's simple enough to abstract over strings in multiple encodings if you need to (thanks to generics).

I like this approach: it allows simplicity in the common case, for software that only needs to work in UTF-8, while allowing support for arbitrary other encodings. "Easy things should be easy, and hard things should be possible."

pcwalton · on May 12, 2014

> It took a bit to wrap my head around this when I first read about it[0], but now that I think about it, I think it's the right way to go[1].

I don't actually think it's the right way to go: I'd prefer if strings enforced that they were UTF-8. That saves a lot of error checking by clients that consume it and want to be defensive. It's most efficient to have the type system do the work.

andreasvc · on May 12, 2014

So how do you then deal with problem in the original blog post: a filename that is a non-UTF8 byte string? Surrogates? A separate byte string type?

pcwalton · on May 12, 2014

Use a separate byte string type. The problem is that paths are not UTF-8. That's a fixable problem: just let the type system know about it.

the_mitsuhiko · on May 12, 2014

> If you're happy with Go's "everything is a unicode string" approach then you should be happy to just treat everything as unicode. Don't handle the decode errors - if someone sends some data to your stdin that's not in the correct encoding, too bad.

Go's approach is transparently passing data through. In Python 3 your process crashes if you do that.

lmm · on May 12, 2014

In my book crashing is better than silently corrupting your data, which is what you'd get if you concatenated two files with different encodings the Go way.

the_mitsuhiko · on May 12, 2014

The idea that data corruption happens is wrong though. There is no data corruption, at least not less than in Python 2 now that we have surrogate escape printing support in some cases.

The only way to avoid data corruption is to be explicit about encodings which in 2.x is easy and in 3.x is almost as easy except if you want to work with stdout/stdin.

suzuki · on May 13, 2014

I totally agree. The transparency of 2.x & Go wins. Python 3.x is a disaster for me, a Unix user in Japan.

ninkendo · on May 12, 2014

Could we all agree to not use the word "unicode" when talking about encoding (ie. how code points are serialized to bytes)? Unicode (ie. the standard set forth by the unicode consortium) has nothing to say about encoding.

I think what you mean is "Go's 'everything is a UTF-8 string' approach", but I'm not familiar enough with Go's internal encoding to know.

For instance, you mention "Don't handle the decode errors", but I can only assume by that you mean UTF-8's decode errors, since UTF-8 has the possibility of having encoding errors, where things like UTF-32 do not. They're both Unicode encodings, so it makes no sense to say it's "Unicode"'s decode errors.

I think the author of this article falls into the same trap all over the place. He uses the word "Unicode" to refer to an encoding all over the place. Until he is able to distinguish the difference between the Unicode standard and all its various encoding methods, there's not much point in reading his article.

(And no, Microsoft's braindead decision to use the word "Unicode" to mean "UCS-2", and later ret-conning it to mean "UTF-16", doesn't count. Don't perpetuate the stupid.)

captaincrowbar · on May 12, 2014

"Unicode (ie. the standard set forth by the unicode consortium) has nothing to say about encoding."

Yes it does. The UTF-8/16/32 encodings are defined in the Unicode standard (chapter 3.9, page 89), they're not some kind of aftermarket add-on.

"UTF-8 has the possibility of having encoding errors, where things like UTF-32 do not."

Also not true. The set of valid Unicode code points is the same regardless of the encoding (0-0xd7ff and 0xe000-0x10ffff); anything outside those two ranges is just as invalid in UTF-32 as its UTF-8 encoded equivalent would be. A stream containing nothing but 0xff bytes, for example, would be illegal regardless of which UTF you tried to interpret it as.

ninkendo · on May 13, 2014

> Yes it does. The UTF-8/16/32 encodings are defined in the Unicode standard (chapter 3.9, page 89), they're not some kind of aftermarket add-on.

Fair enough, but would you not agree that the word is overloaded? The fact that a few encodings are listed in the Unicode standard doesn't make it ok to wave your hand and call it all "Unicode" in essays like these where it very much matters what actual encoding you're talking about. If you mean UTF-8, stay UTF-8, etc.

cool-RR · on May 12, 2014

Worth it if only for `copyfileobj`. As a seasoned Python expert, I was not familiar with that function. From the docs:

shutil.copyfileobj(fsrc, fdst[, length]) Copy the contents of the file-like object fsrc to the file-like object fdst. The integer length, if given, is the buffer size. In particular, a negative length value means to copy the data without looping over the source data in chunks; by default the data is read in chunks to avoid uncontrolled memory consumption. Note that if the current file position of the fsrc object is not 0, only the contents from the current file position to the end of the file will be copied.

andreasvc · on May 12, 2014

I think the main problem here is an impedance mismatch caused by forcing things to be Unicode. While the Python developers are technically correct (the best kind they say..) in claiming that LANG=C means ASCII, that's not how everything else in UNIX works until now, most applications don't crash because of encoding errors. And filenames are byte strings, so forcing Unicode on them is a bad idea.

It would be great if everyone fixed their locale settings and all their filename encodings but in the meantime this will cause even more friction for Python 3 adoption.

andrewstuart · on May 13, 2014

It's a great concern that some of Python's most respected developers such as mitsuhiko and Zed Shaw are not on board with the current future direction of Python. It would be a better world for all if somehow Python 4 could be something that everyone is happy with - I want the mitsuhikos and Zed Shaws of the world to be writing code that I can run as a Python 3 user, written in a language that these top level developers feel enthused about.

Is there no way forward that everyone agrees on? Has anyone ever proposed a solution?

shadowmint · on May 12, 2014

> That I work with "boundary code" so obviously that's harder on Python 3 now (duh)

mhm. I tell people now and then that python 3 (and the python 3 developers) are hostile to people embedding it and using it for low level tasks specifically because of this unicode stuff, and they tend to tell me I should just suck it up.

I suppose I'm morbidly glad not the only one feeling the pain, but really, it honestly feels like python 3 line is just not making any effort towards making this stuff easier and simpler. :/

ygra · on May 12, 2014

Unicode, dealing with text, i18n are never easy and simple. That being said, there are lots of things that work on both Windows and Unix and use Unicode internally, even for file names and paths (e.g. Qt and the already-mentioned Java). Qt is even used by a popular-ish desktop environment. If that approach were that unsuitable and utterly incompatible with the Unix approach on encodings I wonder why it apparently does work.

tormeh · on May 12, 2014

Isn't utf-8 just a string of bytes where each byte represents a sign? Why is it hard?

sp332 · on May 12, 2014

No. http://en.wikipedia.org/wiki/UTF-8#Description How could you encode millions of characters into just 256 values?

tormeh · on May 13, 2014

Honestly, I thought that was what utf-16 was for. I thought the number was the bitlength of a sign.

ygra · on May 13, 2014

The number says how wide a code unit is. However, it doesn't say anything about how many code units are required to encode a single code point. UTF-8 needs 1–4, UTF-16 needs 1–2 and UTF-32 needs 1. They all are able to represent all encoded characters of Unicode, it's just that the individual bytes are different.

kalleboo · on May 13, 2014

No, single characters can be represented by a string of multiple bytes. And you can construct an invalid string of bytes that would cause a Unicode error. The problem comes from what to you in these applications when you come across a Unicode error. Such as, if the user tries to open a file where the filename contains a string of bytes that would be invalid Unicode. Refuse to open the file?

tormeh · on May 13, 2014

Isn't the standard approach to just show a box or similar gibberish? I've seen enough of those.

ygra · on May 13, 2014

Boxes are just the font not having that glyph – that doesn't destroy data. Question marks, U+FFFD or random gibberish are arguably at least as disastrous as refusing to open a file (maybe more because you don't actually notice something is wrong).

_ZeD_ · on May 12, 2014

because is not.

sp332 · on May 12, 2014

Could you use the "bytes" type instead of the "string" type for that low-level stuff?

andreasvc · on May 12, 2014

That would be preferable in these situations. However, the problem is that the standard library uses Unicode for filenames and stdin/out by default, as described in the blog post, and you get errors as soon as anything cannot be decoded.

sp332 · on May 12, 2014

Aw, I knew it couldn't be that easy.

andrewstuart · on May 12, 2014

I hear and understand and agree with the issues raised, the question is what is the right way to fix this stuff? How can we get there?

How can we get the Python 2 stalwarts and the Python 3 folks to all sit in the same figurative room and create a future that everyone is happy with?

It would be nice to see the ongoing grumbling about Python 3 replaced with a tangible peace process.

Are the warring parties talking about solutions?

rectangletangle · on May 13, 2014

It'd be nice, I don't want Python to go the way of PERL.

andrewstuart · on May 13, 2014

The Python 2 "stayers" choices are:

1: stick with Python 2 forever

2: move their skills to another language

3: let go of Python 2 and move to Python 3 despite their concerns

Armin seems so disillusioned that I get the sense he'll either go for option 1 or 2, which is very concerning for people using Jinja2 and Flask and all his other stuff (most of which he has converted to Python 3 albeit unenthusiastically). He has said in one of his blog posts that although he has ported his code to Python 3, he does not use it himself at work and doesn't intend to. Having said that, his stance has softened significantly over the most recent 12 months as evidenced by the full porting of Flask to Python 3.

Zed I'm guessing initially go for option 1 and given his previous disposition to change technologies, might get sick of Python 2's deadness and go for option 2. Zed's public proclamations appear to have invested him quite heavily in not going with Python 3 so it's hard to see what would ever lead him there. Zed's "Learn Python the Hard Way" is the gateway through which new Python programmers are learning and thus all those new developers are starting out as Python 2 only people. If a way can be found to satisfy Zed that some future version of Python is "a good thing", then he will bring his students/followers with him.

But who knows.

It would be good if there was:

4: Zed and Armin and the other most vocal Python 2 advocates specify what they want to see in Python 4, somehow it gets included, everyone happy.

Zed and Armin are by no means the only first class Python developers there's tons of others, but they write really interesting stuff, they are extremely outspoken in their criticism of Python 3, and they have a large and loyal following who respect their opinions, so it would be nice to see them happily participating in Python's future.

What can be done to bring the Python 2 stayers to the most recent releases of Python? Who knows. It's not healthy for Python 3 to have such vocal critics so something should be done.

Even as the 2 versus 3 war continues, Python 3 seems to be gaining real momentum at least as measured by the number of libraries that are now available for Python 3 - it seems that even though some people are sticking with Python 2 there's a groundswell of support for Python 3. After all, for the ordinary programmer trying to get a website built, what's the point in learning the six year old version, all that leads to is the question of "ok, at what time will I learn Python 3?". I'm a beginner and a very ordinary programmer and I find Python 3 much easier to wrap my head around than Python 2 - I dread the times I have to dig into Python 2.

Python 3 will be fine. It has momentum, it will grow, eventually Python 2 will be so far in the past that there will be no way to look at it except in the same way we see OS/2, Amiga and DOS - long gone. It would be much better though if everyone was happy.

Python 4 (maybe it should be 3.6) should be the version that ends the civil war and gives the stayers what they want somehow.

e12e · on May 12, 2014

I don't know... I get an error from the first script with python3:

    $ ls
    test  test3.py  test.py  tøst  日本語
    $ python2.7 test.py *
    hello hellø こにちは tøst 日本語
    import sys
      # (…)
    hello hellø こにちは tøst 日本語
    hello hellø こにちは tøst 日本語
   
    $ python3 test.py *
    Traceback (most recent call last):
      File "test.py", line 13, in <module>
        shutil.copyfileobj(f, sys.stdout)
      File "/usr/lib/python3.2/shutil.py", line 68, in copyfileobj
        fdst.write(buf)
    TypeError: must be str, not bytes
    
    #But I can make it work with:
    $ diff test.py test3.py 
    8c8
    <             f = open(filename, 'rb')
    ---
    >             f = open(filename, 'r')
    
    $ python3 test3.py *
    # same as above

Now, these two scripts are no longer the same, the python3 script outputs text, the python2 script outputs bytes:

    $ python3 test3.py /bin/ls
    Traceback (most recent call last):
      File "test3.py", line 13, in <module>
        shutil.copyfileobj(f, sys.stdout)
      File "/usr/lib/python3.2/shutil.py", line 65, in copyfileobj
        buf = fsrc.read(length)
      File "/usr/lib/python3.2/codecs.py", line 300, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 24: invalid start byte

The other script works like cat -- and dumps all that binary crap to the terminal.

So, yeah, I guess things are different -- not entirely sure that the python3 way is broken, though? It's probably correct to say that it doesn't work well with the "old" unix way in which text was ascii and binary was just bytes -- but consider:

    $ cat /bin/ls |wc
    403    2565  114032
    e12e@stripe:~/tmp/python/unicodetest $ du -b /bin/ls
    114032  /bin/ls

Does that "wordcount" and "linecount" from wc make any sense? For that matter, consider:

    $ cat test
    hello hellø こにちは tøst 日本語
    e12e@stripe:~/tmp/python/unicodetest $ wc test
     1  5 42 test

(Here the word count does make sense, but just because it's an artificial example, it wouldn't make sense for actual Japanese).

The character count is pretty certainly wrong unless you cared about what "du -b" thinks of the number of bytes...

lifeisstillgood · on May 12, 2014

The japanese example is interesting - because wc really rather depends on the language. So does regex. And quite a lot of other things that are useful in a Latin-derived world kind of get harder in a right to left inflected written language (if there is one, some Arabic comes to mind).

I think if anything will force us to rethink the underlying assumptions of Unix, its unicode.

e12e · on May 12, 2014

Please note that:

   $ echo "wc can't count æøæ either" |wc
      1       5      29
   $ echo "wc can't count aaa either" |wc
      1       5      26

[edit: Also, note that Japanese is both left-to-right and top-to-bottom,right-to-left]

lifeisstillgood · on May 12, 2014

Is there a way to divide the Unicode world into ranges, with some clearly marked as "will work with this approach" and others marked differently.

A sort of code-pages approach - but we all work on the same Unicode foundation, just when it comes to Japanese a non-speaker like me would gracefully down-scale all the operations to "print and then suggest we hire some people to write an extension".

Its I guess linking LOCALE to a number range...

EdiX · on May 13, 2014

wc counts bytes, to make it count characters use -m in the GNU version.

lifeisstillgood · on May 13, 2014

I think the point being made is that -m does not count characters, it counts multi-bytes. Or at least tries to. So the same Unicode point in utf-8 and utf-16 (and utf-32) could be very different strings of bytes. No way to tell unless you know before hand you are dealing with utf-8 or 16. Hence BOM, but no one likes that.

Its hard. And possibly we have to abandon tools like wc when we leave the Latin world.

the_mitsuhiko · on May 12, 2014

The first script does not work on Python 3. That's the whole point of the post. You need to use the second one.

e12e · on May 12, 2014

Well, the first script works if you don't open the files as binary files (see my "patch") -- and if you don't you get a sane error when you try to open a binary file. So they're different, but I still say it's not so clear that one is broken, one is not.

I'm also not clear where all the comments on filenames comes from, but maybe they just happen to work in my case?

[edit: see my other comment on treating stdin/out as binary from python3]

DasIch · on May 12, 2014

If you don't open the files as binary, the script doesn't do what it's supposed to.

e12e · on May 13, 2014

from the manpage of cat(1) my emphasis:

"cat - concatenate files and print on the standard output"

Cat has quite a few options that clearly deal with text files: --number-nonblank, --show-ends, --number, -show-tabs, --show-nonprinting, --squeeze-blank ...

BSD cat[1] is a little more conservative, but it's not entirely clear cut that when printing to standard output (or reading from standard input) you want binary mode. And silently copying a binary stream to a text stream does sound like a bug. Even if that is the expected behaviour of standard "cat".

[1] http://www.freebsd.org/cgi/man.cgi?cat(1)

lazzlazzlazz · on May 12, 2014

Isn't the entire point of the post that the first Python3 example will not work?

e12e · on May 12, 2014

Yes, but I think the post is somewhat disingenuous in that if you want to read text-files, you can just open the files in non-binary (ie: the exact same script with a single character changed, the mode from "rb" to "r") -- and it will work for most cases where you can expect it to work.

I've not dived all the way down the (python3) rabbit hole, but it seems to me that there probably is (should be, at least) a way to say, yes, I want to read and write in binary, I don't care about conversion, even if one or both streams happen to be standard input/output/error, not just some other binary file.

I agree that if the second script indeed is the simplest way, then that is probably too hard. But as clearly demonstrated, it's not that hard to read and write "text files".

[edit: to whit: "By default, these [stdin,out,err] streams are regular text streams as returned by the open() function. (...)

To write or read binary data from/to the standard streams, use the underlying binary buffer. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc').

Using io.TextIOBase.detach(), streams can be made binary by default. This function sets stdin and stdout to binary:"

    def make_streams_binary():
        sys.stdin = sys.stdin.detach()
        sys.stdout = sys.stdout.detach()

https://docs.python.org/3/library/sys.html#sys.stdout]

e12e · on May 12, 2014

So, to make the script work similarly for python3:

    $ diff test.py test3.py 
    5c5
    <     f = sys.stdin
    ---
    >     f = sys.stdin.buffer
    13c13
    <         shutil.copyfileobj(f, sys.stdout)
    ---
    >         shutil.copyfileobj(f, sys.stdout.buffer)

We can now read in binary, and write to stdout in binary, and this script will happily garble your terminal if you point it at /bin/ls. If the lines above were unclear, here is the entire "python3" version. Note that we now explicitly treat standard input and output as binary (which is "wrong"):

    import sys
    import shutil

    for filename in sys.argv[1:]:
        f = sys.stdin.buffer
        if filename != '-':
            try:
                f = open(filename, 'rb')
            except IOError as err:
                print >> sys.stderr, 'cat.py: %s: %s' % (filename, err)
                continue
        with f:
            shutil.copyfileobj(f, sys.stdout.buffer)

And we can now, for example:

    $ head -c 200 /bin/ls | python3 test3.py -
    ELF>�H@@p�@8  @@@@@@�88@8@@ $

The crucial difference is, that if we wanted only to deal with text, we'd get a sane error (see above) from python3. I'd probably add a comment about the sys.stdin.buffer as it isn't exactly obvious that what we do is go from stdin that is a text stream to the underlying buffer that is binary -- but I can't really agree that this is super-hard -- it took me a few minutes to google "python3 binary io" and find this…

And all these scripts (appear to) deal fine with utf-8 file names both in python2 and 3.

[edit: Having redone this by hand, I see that this is sort of what the "new" script in the post does, with some caveats on how stdin/out is "sometimes" not not binary... I still have a hard time accepting that this short python3 script is much more fragile than the original short python2 script. The fact that you get an error if you try to copy a binary stream to a text stream seems sane to me...]

keyme · on May 12, 2014

Strings should be byte strings. Not ASCII, not Unicode. Bytes.

Strings don't represent Text lest I decide they do. For this a UnicodeString object should exist, and it should _not_ be the default.

In my latest project I've made myself use Python 3.4 over 2.7, for its new great features. So many steps forward, except this one thing.

What a stupid decision are these default Unicode strings...

andrewstuart · on May 12, 2014

Wouldn't people be complaining if "the unicode problem" hadn't been solved in Python rather than leaving it an undefined mess? Now it is a solved problem even if the solution is seen as a problem by some.

pekk · on May 12, 2014

From the one person who has complained most about this topic, making him an expert on complaining about Python 3 but not necessarily as much of an expert on how to cope.

skizm · on May 12, 2014

Bit off topic, but can anyone recommend a good tutorial/book/whatever for python 2 programmers looking to move to (or at least become familiar with) python 3?

maxerickson · on May 12, 2014

What's New in 3.0 has lots of information:

https://docs.python.org/3/whatsnew/3.0.html

and maybe take a look at what standard modules have moved to different name or namespace:

https://docs.python.org/3/py-modindex.html

daftshady · on May 12, 2014

Here's good porting guide http://python3porting.com/

bdevine · on May 12, 2014

I liked the latest edition of "Python Cookbook" by Beazley and Jones for precisely this.

im3w1l · on May 13, 2014

>For instance it will attempt decoding from utf-8 with replacing decoding errors with question marks.

Please don't do this. Replacing with question mark is a lossy transformation. If you use a lossless transformation, a knowledgeable user of your program will be able to reverse the garbling, in their head, or using a tool. Consider Ã¥Ã¤Ã¶, the result of interpreting utf8 åäö as latin1. You could find both the reason and solution by googling on it.

Retr0spectrum · on May 12, 2014

Did anyone else find the title font hard to read?

rectangletangle · on May 12, 2014

Yeah, real narrow fonts seem to have pretty poor readability.

jrochkind1 · on May 13, 2014

I have to admit I can't follow this completely -- dealing with file system file names that are not in ascii is a very confusing thing, and one I haven't done before -- plus I am not very familiar with python.

But I have done a lot of dealing with char encoding issues though -- in ruby.

In ruby 1.9+, I find ruby's char encoding handling to be quite good. Which does not mean it's not incredibly challenging and confusing to deal with char encoding issues. But it means I haven't been able to come up with any better approach than ruby 1.9+'s, anything I wish ruby 1.9+ did differently.

The mental model is simple (relatively, for the domain anyway) -- any strings are tagged with an encoding. If your string contains illegal bytes for the encoding it's tagged with, it's gonna raise if you try to concatenate it or do much anything else with it. Concatenating strings of two different encodings is probably going to raise too (some exceptions if they are both ascii supersets and happen to contain only ascii-valid 7-bit chars). You can easily check if a string contains any illegal bytes; change the tagged encoding to any encoding you like (including the 'binary' null encoding); remove bad bytes; or trans-code from one encoding to another.

It means that you have all the tools you need to deal with char encoding issues, but you still need to think through some complicated and confusing issues to deal with em. It is an inherently confusing domain (which is why it's nice that more and more of the time you can just assume UTF8 everywhere -- but yes, I've written plenty of code that can't assume that too, or that has to deal with bad bytes in presumed UTF8)

(The biggest frustrations can be when using gems (libraries) that themselves aren't dealing with char encoding correctly, and then you find yourself debugging someone elses code and trying to convince them that their code is incorrect when they're putting up a fight cause it's so damn confusing. There are still plenty of encoding related bugs. But I'm not sure that's ruby's fault).

You certainly can deal with everything as a byte stream (the 'binary' null encoding) if you want to in ruby, as far as the language is concerned, although I don't think you actually usually want to. (and some open source gems might not play well with that approach either)

It would be interesting to see someone who understands both ruby and python take the OP and analogize the problem case to ruby 1.9+ and see if it's any different.

(One important thing ruby was missing prior to 2.1 is the new String#scrub method. It was possible to write it yourself though, which I figured out eventually. Another thing I still wish ruby had built-in to stdlib was more of the Unicode algorithms (sort collation, case change, etc.), although there are gems for most of em these days, thanks open source.)