Re: [Tutor] regex questions

Steven D'Aprano Fri, 18 Feb 2011 02:15:15 -0800

Albert-Jan Roskam wrote:

So the raw string \b means means "ASCII backspace". Is that another way ofsaying that it means 'Word boundary'?

No.

Python string literals use backslash escapes for special characters,similar to what many other computer languages, including C, do.

So when you type "hello world\n" as a *literal* in source code, the \ndoesn't mean backslash-n, but it means a newline character. The specialescapes used by Python include:


\0  NULL (ASCII code 0)
\a  BELL character (ASCII code 7)
\b  BACKSPACE (ASCII code 8)
\n  newline
\t  tab
\r  carriage return
\'  single quote  (does not close string)
\"  double quote  (does not close string)
\\  backslash
\0nn  character with ASCII code nn in octal
\xXX  character with ASCII code XX in hex

\b (backspace) doesn't have anything to do with word boundaries.

Regexes, however, are a computer language in themselves, and they use an*actual backslash* to introduce special meaning. Because that backslashclashes with the use of backslashes in Python string literals, you haveto work around the clash. You could do any of these:


# Escape the backslash, so Python won't treat it as special:
pattern = '\\bword\\b'

# Use chr() to build up a non-literal string:
pattern = chr(92) + 'bword' + chr(92) + 'b'

# Use raw strings:
pattern = r'\bword\b'

The Python compiler treats backslashes as just an ordinary characterwhen it compiles raw strings. So that's the simplest and best solution.

You're right: debugging regexes is a PIA. One teeny weeny mistake makes all thedifference. Could one say that, in general, it's better to use a Divide andConquer strategy and use a series of regexes and other string operations toreach one's goal?


Absolutely!



--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regex questions

Reply via email to