Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The length of an array should correspond to the number of elements. Since each element is a code point, it's the most relevant number if you intend to operate on individual elements. That is, the maximum index corresponds to the length of the array.

If you care about the number of bytes, or to operate on individual bytes, then convert to utf-8,16 or 32, and operate on the bytes object. If you wish to operate on grapheme clusters, then you could probably find some 3rd party Python library that allows you to represent and operate on strings in terms of grapheme clusters.



A string is not an array, it is a chunk of text, for the vast majority of uses of strings. Exactly how that chunk of text is represented in memory and what API it should expose is the discussion we're having. My point is that it shouldn't be exposed as an array of codepoints, since array operations (lengths, indexing, taking a range) are not a very useful way of manipulating text; and even if we did expose them as an array, Unicode code points are definitely not a useful data structure for almost any purpose.

There are basically only two things that can be done with a Unicode codepoint: encode it in bytes for storage, or transform it to a glyph in a particular font or culture.

You can't even compare two sequences of Unicode codepoints for equality in many cases, since there are different ways to represent the same text with Unicode. For example the strings "thá" and "thá" are different in terms of codepoints, but most people would expect to find the second when typing in the first. Even worse, there are codepoints which are supposed to represent different characters, depending on the font being used / the locale of the display (the same Unicode codepoints are used to represent related Chinese, Japanese, or Korean characters, even when these characters are not identical between the three cultures).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: