On Thu, Jan 28, 2016 at 2:23 PM, Albert-Jan Roskam <sjeik_ap...@hotmail.com> wrote: > Out of curiosity, I wrote the throw-away script below to find a character > that is classified > (--> LC_CTYPE) as digit in one locale, but not in another.
The re module is the wrong tool for this. The re.LOCALE flag is only for byte strings, and in this case only ASCII 0-9 are matched as decimal digits. It doesn't call the isdigit() ctype function. Using Unicode with re.LOCALE is wrong. The current locale doesn't affect the meaning of a Unicode character. Starting with 3.6 doing this will raise an exception. The POSIX ctype functions such as isalnum and isdigit are limited to a single code in the range 0-255 and EOF (-1). For UTF-8, the ctype functions return 0 in the range 128-255 (i.e. lead bytes and trailing bytes aren't characters). Even if this range has valid characters in a given locale, it's meaningless to use a Unicode value from the Latin-1 block, unless the locale uses Latin-1 as its codeset. Python 2's str uses the locale-aware isdigit() function. However, all of the locales on my Linux system use UTF-8, so I have to switch to Windows to demonstrate two locales that differ with respect to isdigit(). You could use PyWin32 or ctypes to iterate over all the locales known to Windows, if it mattered that much to you. The English locale (codepage 1252) includes superscript digits 1, 2, and 3: >>> locale.setlocale(locale.LC_CTYPE, 'English_United Kingdom') 'English_United Kingdom.1252' >>> [chr(x) for x in range(256) if chr(x).isdigit()] ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '\xb2', '\xb3', '\xb9'] >>> unicodedata.name('\xb9'.decode('1252')) 'SUPERSCRIPT ONE' >>> unicodedata.name('\xb2'.decode('1252')) 'SUPERSCRIPT TWO' >>> unicodedata.name('\xb3'.decode('1252')) 'SUPERSCRIPT THREE' Note that using the re.LOCALE flag doesn't match these superscript digits: >>> re.findall(r'\d', '0123456789\xb2\xb3\xb9', re.LOCALE) ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] The Windows Greek locale (codepage 1253) substitutes "Ή" for superscript 1: >>> locale.setlocale(locale.LC_CTYPE, 'Greek_Greece') 'Greek_Greece.1253' >>> [chr(x) for x in range(256) if chr(x).isdigit()] ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '\xb2', '\xb3'] >>> unicodedata.name('\xb9'.decode('1253')) 'GREEK CAPITAL LETTER ETA WITH TONOS' _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor