Josep M. Fontana wrote:
[...]
I guess this is because the character encoding was not specified but
accented characters in the languages I'm dealing with should be
treated as a-z or A-Z, shouldn't they?

No. a-z means a-z. If you want the localized set of alphanumeric characters, you need \w.

Likewise 0-9 means 0-9. If you want localized digits, you need \d.


> I mean, how do you deal with
languages that are not English with regular expressions? I would
assume that as long as you set the right encoding, Python will be able
to determine which subset of specific sequences of bytes count as a-z
or A-Z.

Encodings have nothing to do with this issue.

Literal characters a, b, ..., z etc. always have ONE meaning: they represent themselves (although possibly in a case-insensitive fashion). E means E, not È, É, Ê or Ë.

Localization tells the regex how to interpret special patterns like \d and \w. This has nothing to do with encodings -- by the time the regex sees the string, it is already dealing with characters. Localization is about what characters are in categories ("is 5 a digit or a letter? how about ٣ ?").

Encoding is used to translate between bytes on disk and characters. For example, the character Ë could be stored on disk as the hex bytes:

\xcb              # one byte
\xc3\x8b          # two bytes
\xff\xfe\xcb\x00  # four bytes

and more, depending on the encoding used.


--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to