Albert-Jan Roskam wrote:
So the raw string \b means means "ASCII backspace". Is that another way of
saying that it means 'Word boundary'?
No.
Python string literals use backslash escapes for special characters,
similar to what many other computer languages, including C, do.
So when you type "hello world\n" as a *literal* in source code, the \n
doesn't mean backslash-n, but it means a newline character. The special
escapes used by Python include:
\0 NULL (ASCII code 0)
\a BELL character (ASCII code 7)
\b BACKSPACE (ASCII code 8)
\n newline
\t tab
\r carriage return
\' single quote (does not close string)
\" double quote (does not close string)
\\ backslash
\0nn character with ASCII code nn in octal
\xXX character with ASCII code XX in hex
\b (backspace) doesn't have anything to do with word boundaries.
Regexes, however, are a computer language in themselves, and they use an
*actual backslash* to introduce special meaning. Because that backslash
clashes with the use of backslashes in Python string literals, you have
to work around the clash. You could do any of these:
# Escape the backslash, so Python won't treat it as special:
pattern = '\\bword\\b'
# Use chr() to build up a non-literal string:
pattern = chr(92) + 'bword' + chr(92) + 'b'
# Use raw strings:
pattern = r'\bword\b'
The Python compiler treats backslashes as just an ordinary character
when it compiles raw strings. So that's the simplest and best solution.
You're right: debugging regexes is a PIA. One teeny weeny mistake makes all the
difference. Could one say that, in general, it's better to use a Divide and
Conquer strategy and use a series of regexes and other string operations to
reach one's goal?
Absolutely!
--
Steven
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor