In <[EMAIL PROTECTED]>, Christian Sonne wrote:
> Long story short, I'm trying to find all ISBN-10 numbers in a multiline
> string (approximately 10 pages of a normal book), and as far as I can
> tell, the *correct* thing to match would be this:
> ".*\D*(\d{10}|\d{9}X)\D*.*"
>
> (it should be noted that I've removed all '-'s in the string, because
> they have a tendency to be mixed into ISBN's)
>
> however, on my 3200+ amd64, running the following:
>
> reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")
> isbn10s = reISBN10.findall(contents)
>
> (where contents is the string)
>
> this takes about 14 minutes - and there are only one or two matches...
First of all try to get rid of the '.*' at both ends of the regexp. Don't
let the re engine search for any characters that you are not interested in
anyway.
Then leave off the '*' after '\D'. It doesn't matter if there are
multiple non-digits before or after the ISBN, there just have to be at
least one. BTW with the star it even matches *no* non-digit too!
So the re looks like this: '\D(\d{10}|\d{9}X)\D'
Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list