Unicode In Python, Completely Demystified

ludwigvan · on Feb 26, 2011

The main trick is to remember is, to get utf-8, you have to encode, not decode. When I didn't know how unicode works that well, I mistakenly assumed unicode is coded, so I have to decode it. Internally, unicode is of course coded (that is represented by some coding), but you don't have to know it. As the slides say, just encode to utf8 when you print text out, or display it in a web browser etc; and you will be fine.

One other unicode library I can recommend is icu, Python bindings to IBM's ICU. It solved some Turkish specific problems I have. (Turkish alphabet has I,İ, ı, i which makes upper-lower case conversions tricky. For example, mayıs becomes MAYIS when capitalized while ENGLISH becomes englısh when lowercased.)

Read http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I , http://www.joelonsoftware.com/articles/Unicode.html and http://www.codinghorror.com/blog/2008/03/whats-wrong-with-tu... for more info.

Another issue, if you are doing web dev is, how to represent non-ascii characters in urls. One choice is url encoding, the other is slugifying the url, that is, choosing an ascii equivalent for the non-ascii character. For example, Django has a slugify function that helps with this. It converts, for example, über to uber.

zepolen · on Feb 26, 2011

>It converts, for example, über to uber.

What does it convert 'ぬびばざべ' into?

admp · on Feb 26, 2011

ICU transliterator is fairly sophisticated (and configurable), actually. Take a look at the examples:

  キャンパス -> kyanpasu
  Αλφαβητικός Κατάλογος -> Alphabētikós Katálogos
  биологическом -> biologichyeskom

From: http://userguide.icu-project.org/transforms/general

Edit: reread, OP was talking about Django transliteration, which is much simpler.

ludwigvan · on Feb 26, 2011

Yes, Django slugifier turns ぬびばざべ to an empty text. Maybe that's because my language is set to tr-tr, I am not sure. One possible downside with icu is that, afaik, you can't run it in Google App Engine.

beaumartinez · on Feb 26, 2011

Dive Into Python 3 does a very good job of explaining what a character and what a byte is, their difference, and why it is important (particularly for Python 3). http://diveintopython3.org/strings.html

Spolsky's article on character encoding is equally as good. http://www.joelonsoftware.com/articles/Unicode.html

I recommend both to every programmer.

est · on Feb 26, 2011

They main problem with Python encoding is the cmd.exe on Windows does not support certain >127 ASCII, thus Ordinal >127 error is pretty common python encoding error. But Python could handle >127 ASCII well internally. The real devil in the python encoding myth is the `print` function.

BTW using `mbcs` encoding on Windows is more compatible than ASCII in most cases.

adamtj · on Feb 26, 2011

There are no ASCII characters > 127. ASCII is a 7-bit encoding. That's why you run into cp1252 all the time if you work with Western European languages, such as English. cp1252 is an 8-bit superset of ASCII designed for Western European languages. I believe it's true that the first 256 unicode code points are actually cp1252, which implies that the first 128 code points correspond with ASCII.

beaumartinez · on Feb 26, 2011

Windows-1252 (cp1252) is similar to ISO-8859-1 (Latin-1), which is a subset of Unicode; both are a superset of ASCII. Windows-1252 however has a few characters mapped to different codepoints.

adamtj · on Feb 26, 2011

That's right, thank you.

est · on Feb 27, 2011

> There are no ASCII characters > 127. ASCII is a 7-bit encoding.

Then there is no reason Python using ASCII to handle 8bit byte and yield error.

timrobinson · on Feb 26, 2011

The Windows console does support encodings: see the `chcp` command, which changes the code page. The cmd.exe builtins can support UTF-16 regardless of the code page.

When I last looked at this, Python doesn't attempt to deal with the Windows console code page.

piranha · on Feb 26, 2011

Hm, strange, I have no problems using cp866 to print something to console on Windows.

gorset · on Feb 26, 2011

CPython aren't actually using unicode strings. Instead of having a sequence of codepoints, you will have either UCS2 or UCS4 strings depending on compile time options.

For example, freshly installed python3.2 from macports:

    >>> len(chr(119074))
    2
    >>> chr(119074)
    '𝄢'
    >>> print chr(119074)
    𝄢

(Surrogates are so much fun).

Python 2.7.1 compiled with UCS4:

    >>> len(unichr(119074))
    1
    >>> unichr(119074)
    u'\U0001d122'
    >>> print(chr(119074))
    𝄢

Notice the capital U takes 8 instead of 4 hexidecimals.

joeyh · on Feb 26, 2011

So, that sounds fairly similar to "unicode in $arbitrary_language": Unicode is supported internally; there is still complexity associated in getting data decoded/encoded when bringing it in/out; new versions of the language are working to reduce the complexity but somehow not there yet; handling of unicode by libraries tends to be inconsistent and probably undocumented.

Someone should do a comparative analysis across languages. My guess is that (in-browser) javascript is probably one of the few languages to not have significant unicode problems, although there are surely some.

crccheck · on Feb 26, 2011

If you have a hard time reading this, disable JavaScript.

yuvadam · on Feb 26, 2011

If you have a hard time reading this, just press the left and right buttons to run the slides.

admp · on Feb 26, 2011

Pretty hard when all you have is a touch screen. :-)

BoppreH · on Feb 26, 2011

What baffles me that we are trying to convince programmers to support Unicode while many programs and websites won't even accept special symbols. In passwords.

I mean, how dumb does the system have to be to reject the @ char in the password field? Or worse yet, accept only _numbers_, as my university does.

I think we did something terribly wrong somewhere in the char set conventions.

julian37 · on Feb 27, 2011

Appears to be mostly good information, though I'm surprised he paints UTF-16 in such a positive light ("optimized for languages residing in the 2 byte character range."). UTF-16 should be considered harmful:

http://benlynn.blogspot.com/2011/02/utf-8-good-utf-16-bad.ht...

http://stackoverflow.com/questions/1049947/should-utf-16-be-...