Re: [Tutor] Problems processing accented characters in ISO-8859-1 encoded texts

Steven D'Aprano Thu, 23 Dec 2010 02:53:58 -0800

Josep M. Fontana wrote:

I am working with texts that are encoded as ISO-8859-1. I have
included the following two lines at the beginning of my python script:


!/usr/bin/env python
# -*- coding: iso-8859-1 -*-

If I'm not mistaken, this should tell Python that accented characters
such as 'á', 'Á', 'ö' or 'è' should be considered as alpha-numeric
characters and therefore matched with a regular expression of the form
[a-zA-Z].


You are mistaken. a-zA-Z always means the ASCII A to Z, and nothing else.

You are conflating three unrelated problems:

(1) What encoding is used to convert the bytes on disk of the sourcecode literals into characters?


(2) What encoding is used for the data fed to the regex engine?

(3) What characters does the regex engine consider to be alphanumeric?

The encoding line only tells Python what encoding to use to read thesource code. It has no effect on text read from files, or byte strings,or anything else. It is only to allow literals and identifiers to bedecoded correctly, and has nothing to do with regular expressions.


To match accented characters, you can do two things:

(1) explicitly include the accented characters you care about in
    the regular expression;

or

(2) i.   set the current LOCALE to a locale that includes the
         characters you care about;
    ii.  search for the \w regex special sequence; and
    iii. include the ?L flag in the regex.


In both cases, don't forget to use Unicode strings, not byte strings.

For example:


>>> text = u"...aböyz..."
>>> re.search(r'[a-zA-Z]+', text).group(0)
u'ab'


Setting the locale on its own isn't enough:

>>> locale.setlocale(locale.LC_ALL, 'de_DE')
'de_DE'
>>> re.search(r'[a-zA-Z]+', text).group(0)
u'ab'

Nor is using the locale-aware alphanumeric sequence, since the regexengine is still using the default C locale:


>>> re.search(r'\w+', text).group(0)
u'ab'

But if you instruct the engine to use the current locale instead, thenit works:


>>> re.search(r'(?L)\w+', text).group(0)
u'ab\xf6yz'

(Don't be put off by the ugly printing representation of the unicodestring. \xf6 is just the repr() of the character ö.)

Oh, and just to prove my point that a-z is always ASCII, even with thelocale set:


>>> re.search(r'(?L)[a-zA-Z]+', text).group(0)
u'ab'

Note also that \w means alphanumeric, not just alpha, so it will alsomatch digits.





--
Steven

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Problems processing accented characters in ISO-8859-1 encoded texts

Reply via email to