The main trick is to remember is, to get utf-8, you have to encode, not decode. When I didn't know how unicode works that well, I mistakenly assumed unicode is coded, so I have to decode it. Internally, unicode is of course coded (that is represented by some coding), but you don't have to know it. As the slides say, just encode to utf8 when you print text out, or display it in a web browser etc; and you will be fine.
One other unicode library I can recommend is icu, Python bindings to IBM's ICU. It solved some Turkish specific problems I have. (Turkish alphabet has I,İ, ı, i which makes upper-lower case conversions tricky. For example, mayıs becomes MAYIS when capitalized while ENGLISH becomes englısh when lowercased.)
Another issue, if you are doing web dev is, how to represent non-ascii characters in urls. One choice is url encoding, the other is slugifying the url, that is, choosing an ascii equivalent for the non-ascii character. For example, Django has a slugify function that helps with this. It converts, for example, über to uber.
Yes, Django slugifier turns ぬびばざべ to an empty text. Maybe that's because my language is set to tr-tr, I am not sure. One possible downside with icu is that, afaik, you can't run it in Google App Engine.
Dive Into Python 3 does a very good job of explaining what a character and what a byte is, their difference, and why it is important (particularly for Python 3). http://diveintopython3.org/strings.html
They main problem with Python encoding is the cmd.exe on Windows does not support certain >127 ASCII, thus Ordinal >127 error is pretty common python encoding error. But Python could handle >127 ASCII well internally. The real devil in the python encoding myth is the `print` function.
BTW using `mbcs` encoding on Windows is more compatible than ASCII in most cases.
There are no ASCII characters > 127. ASCII is a 7-bit encoding. That's why you run into cp1252 all the time if you work with Western European languages, such as English. cp1252 is an 8-bit superset of ASCII designed for Western European languages. I believe it's true that the first 256 unicode code points are actually cp1252, which implies that the first 128 code points correspond with ASCII.
Windows-1252 (cp1252) is similar to ISO-8859-1 (Latin-1), which is a subset of Unicode; both are a superset of ASCII. Windows-1252 however has a few characters mapped to different codepoints.
The Windows console does support encodings: see the `chcp` command, which changes the code page. The cmd.exe builtins can support UTF-16 regardless of the code page.
When I last looked at this, Python doesn't attempt to deal with the Windows console code page.
CPython aren't actually using unicode strings. Instead of having a sequence of codepoints, you will have either UCS2 or UCS4 strings depending on compile time options.
For example, freshly installed python3.2 from macports:
So, that sounds fairly similar to "unicode in $arbitrary_language": Unicode is supported internally; there is still complexity associated in getting data decoded/encoded when bringing it in/out; new versions of the language are working to reduce the complexity but somehow not there yet; handling of unicode by libraries tends to be inconsistent and probably undocumented.
Someone should do a comparative analysis across languages. My guess is that (in-browser) javascript is probably one of the few languages to not have significant unicode problems, although there are surely some.
What baffles me that we are trying to convince programmers to support Unicode while many programs and websites won't even accept special symbols. In passwords.
I mean, how dumb does the system have to be to reject the @ char in the password field? Or worse yet, accept only _numbers_, as my university does.
I think we did something terribly wrong somewhere in the char set conventions.
Appears to be mostly good information, though I'm surprised he paints UTF-16 in such a positive light ("optimized for languages residing in the 2 byte character range."). UTF-16 should be considered harmful:
One other unicode library I can recommend is icu, Python bindings to IBM's ICU. It solved some Turkish specific problems I have. (Turkish alphabet has I,İ, ı, i which makes upper-lower case conversions tricky. For example, mayıs becomes MAYIS when capitalized while ENGLISH becomes englısh when lowercased.)
Read http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I , http://www.joelonsoftware.com/articles/Unicode.html and http://www.codinghorror.com/blog/2008/03/whats-wrong-with-tu... for more info.
Another issue, if you are doing web dev is, how to represent non-ascii characters in urls. One choice is url encoding, the other is slugifying the url, that is, choosing an ascii equivalent for the non-ascii character. For example, Django has a slugify function that helps with this. It converts, for example, über to uber.