Re: RegExp performance?

Marc 'BlackJack' Rintsch Sun, 25 Feb 2007 01:36:05 -0800

In <[EMAIL PROTECTED]>, Christian Sonne wrote:

> Long story short, I'm trying to find all ISBN-10 numbers in a multiline 
> string (approximately 10 pages of a normal book), and as far as I can 
> tell, the *correct* thing to match would be this:
> ".*\D*(\d{10}|\d{9}X)\D*.*"
> 
> (it should be noted that I've removed all '-'s in the string, because 
> they have a tendency to be mixed into ISBN's)
> 
> however, on my 3200+ amd64, running the following:
> 
> reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")
> isbn10s = reISBN10.findall(contents)
> 
> (where contents is the string)
> 
> this takes about 14 minutes - and there are only one or two matches...


First of all try to get rid of the '.*' at both ends of the regexp.  Don't
let the re engine search for any characters that you are not interested in
anyway.

Then leave off the '*' after '\D'.  It doesn't matter if there are
multiple non-digits before or after the ISBN, there just have to be at
least one.  BTW with the star it even matches *no* non-digit too!

So the re looks like this: '\D(\d{10}|\d{9}X)\D'

Ciao,
        Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegExp performance?

Reply via email to