Josep M. Fontana wrote:
[...]
I guess this is because the character encoding was not specified but
accented characters in the languages I'm dealing with should be
treated as a-z or A-Z, shouldn't they?
No. a-z means a-z. If you want the localized set of alphanumeric
characters, you need \w.
Likewise 0-9 means 0-9. If you want localized digits, you need \d.
> I mean, how do you deal with
languages that are not English with regular expressions? I would
assume that as long as you set the right encoding, Python will be able
to determine which subset of specific sequences of bytes count as a-z
or A-Z.
Encodings have nothing to do with this issue.
Literal characters a, b, ..., z etc. always have ONE meaning: they
represent themselves (although possibly in a case-insensitive fashion).
E means E, not È, É, Ê or Ë.
Localization tells the regex how to interpret special patterns like \d
and \w. This has nothing to do with encodings -- by the time the regex
sees the string, it is already dealing with characters. Localization is
about what characters are in categories ("is 5 a digit or a letter? how
about ٣ ?").
Encoding is used to translate between bytes on disk and characters. For
example, the character Ë could be stored on disk as the hex bytes:
\xcb # one byte
\xc3\x8b # two bytes
\xff\xfe\xcb\x00 # four bytes
and more, depending on the encoding used.
--
Steven
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor