> >> East Asian languages. But later on Joel Spolsky's "standard" > page about unicode >> I read that it goes to 6 bytes. That's what I implied when I mentioned > "utf8". > > Each surrogate in a UTF-16 surrogate pair is 10 bits, for a total of > 20-bits. Thus UTF-16 sets the upper bound on the number of code points > at 2**20 + 2**16 (BMP). UTF-8 only needs 4 bytes for this number of > codes. > >> A certain locale implies a certain codepage (on Windows), but where does > the locale >> category LC_CTYPE fit in this story? > > LC_CTYPE is the locale category that classifies characters. In Debian > Linux, the English-language locales copy LC_CTYPE from the i18n > (internationalization) locale: Thanks for the links. Without examples it remains pretty abstract, but I think I know is meant by this locale category now.. "The LC_CTYPE category shall define character classification, case conversion, and other character attributes. So if you switch from one locale to another, certain attributes of a character set might change". A switch from locale A to locale B might affect an attribute "casing", therefore, the mapping from lower- to uppercase *might* differ by locale. In stupid country X "a".upper() may return "B".
It seems that the result of str.isalpha() and str.isdigit() *might* be different depending on the setting of locale.C_CTYPE. It is pretty sick that all these things can be adjusted separately (what is the use of having: danish collation, russian case conversion, english decimal sign, japanese codepage ;-) > The i18n locale is defined by the ISO/IEC technical report 14652, as > an instance of an upward compatible extension to the POSIX locale > specification called the FDCC-set (i.e. Set of Formal Definitions of > Cultural Conventions). Here it is in all its glory, if you like > reading technical reports: > > http://www.open-std.org/jtc1/sc22/wg20/docs/n972-14652ft.pdf > If that's not enough, here's the POSIX 1003.1 locale spec: > > short: http://goo.gl/aOJUx > http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html That one is the clearest IMHO. Oh no, now I see the possible impact on regexes. The meaning of e.g. "\s+" might change depending on the locale.C_CTYPE setting!! >> Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)? > > Narrow builds create UTF-16 surrogate pairs from \U literals, but > these aren't treated as an atomic unit for slicing, iteration, or > string length. That is a nice way of putting it. So if you slice a multibyte char "mb", mb[0] will return the first byte? That is annoying. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor