Josep M. Fontana wrote:
I am working with texts that are encoded as ISO-8859-1. I have
included the following two lines at the beginning of my python script:

!/usr/bin/env python
# -*- coding: iso-8859-1 -*-

If I'm not mistaken, this should tell Python that accented characters
such as 'á', 'Á', 'ö' or 'è' should be considered as alpha-numeric
characters and therefore matched with a regular expression of the form
[a-zA-Z].

You are mistaken. a-zA-Z always means the ASCII A to Z, and nothing else.

You are conflating three unrelated problems:

(1) What encoding is used to convert the bytes on disk of the source code literals into characters?

(2) What encoding is used for the data fed to the regex engine?

(3) What characters does the regex engine consider to be alphanumeric?


The encoding line only tells Python what encoding to use to read the source code. It has no effect on text read from files, or byte strings, or anything else. It is only to allow literals and identifiers to be decoded correctly, and has nothing to do with regular expressions.

To match accented characters, you can do two things:

(1) explicitly include the accented characters you care about in
    the regular expression;

or

(2) i.   set the current LOCALE to a locale that includes the
         characters you care about;
    ii.  search for the \w regex special sequence; and
    iii. include the ?L flag in the regex.


In both cases, don't forget to use Unicode strings, not byte strings.

For example:


>>> text = u"...aböyz..."
>>> re.search(r'[a-zA-Z]+', text).group(0)
u'ab'


Setting the locale on its own isn't enough:

>>> locale.setlocale(locale.LC_ALL, 'de_DE')
'de_DE'
>>> re.search(r'[a-zA-Z]+', text).group(0)
u'ab'


Nor is using the locale-aware alphanumeric sequence, since the regex engine is still using the default C locale:

>>> re.search(r'\w+', text).group(0)
u'ab'


But if you instruct the engine to use the current locale instead, then it works:

>>> re.search(r'(?L)\w+', text).group(0)
u'ab\xf6yz'


(Don't be put off by the ugly printing representation of the unicode string. \xf6 is just the repr() of the character ö.)


Oh, and just to prove my point that a-z is always ASCII, even with the locale set:

>>> re.search(r'(?L)[a-zA-Z]+', text).group(0)
u'ab'


Note also that \w means alphanumeric, not just alpha, so it will also match digits.




--
Steven

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to