Re: [Tutor] A regular expression problem

Steven D'Aprano Wed, 01 Dec 2010 03:22:13 -0800

Josep M. Fontana wrote:
[...]

I guess this is because the character encoding was not specified but
accented characters in the languages I'm dealing with should be

treated as a-z or A-Z, shouldn't they?

No. a-z means a-z. If you want the localized set of alphanumericcharacters, you need \w.


Likewise 0-9 means 0-9. If you want localized digits, you need \d.


> I mean, how do you deal with

languages that are not English with regular expressions? I would
assume that as long as you set the right encoding, Python will be able
to determine which subset of specific sequences of bytes count as a-z
or A-Z.


Encodings have nothing to do with this issue.

Literal characters a, b, ..., z etc. always have ONE meaning: theyrepresent themselves (although possibly in a case-insensitive fashion).E means E, not È, É, Ê or Ë.

Localization tells the regex how to interpret special patterns like \dand \w. This has nothing to do with encodings -- by the time the regexsees the string, it is already dealing with characters. Localization isabout what characters are in categories ("is 5 a digit or a letter? howabout ٣ ?").

Encoding is used to translate between bytes on disk and characters. Forexample, the character Ë could be stored on disk as the hex bytes:


\xcb              # one byte
\xc3\x8b          # two bytes
\xff\xfe\xcb\x00  # four bytes

and more, depending on the encoding used.


--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] A regular expression problem

Reply via email to