Liam Clarke wrote: > Hi all, > > Using Beautiful Soup and regexes.. I've noticed that all the examples > used regexes like so - anchors = parseTree.fetch("a", > {"href":re.compile("pattern")} ) instead of precompiling the pattern. > > Myself, I have the following code - > >>>>z = [] >>>>x = q.findNext("a", {"href":re.compile(".*?thread/[0-9]*?/.*", > > re.IGNORECASE)}) > > >>>>while x: > > ... num = x.findNext("td", "tableColA") > ... h = (x.contents[0],x.attrMap["href"],num.contents[0]) > ... z.append(h) > ... x = x.findNext("a",{"href":re.compile(".*?thread/[0-9]*?/.*", > re.IGNORECASE)}) > ... > > This gives me a correct set of results. However, using the following - > > >>>>z = [] >>>>pattern = re.compile(".*?thread/[0-9]*?/.*", re.IGNORECASE) >>>>x = q.findNext("a", {"href":pattern)}) > > >>>>while x: > > ... num = x.findNext("td", "tableColA") > ... h = (x.contents[0],x.attrMap["href"],num.contents[0]) > ... z.append(h) > ... x = x.findNext("a",{"href":pattern} ) > > will only return the first found tag. > > Is the regex only evaluated once or similar?
I don't know why there should be any difference unless BS modifies the compiled regex object and for some reason needs a fresh one each time. That would be odd and I don't see it in the source code. The code above has a syntax error (extra paren in the first findNext() call) - can you post the exact non-working code? > > (Also any pointers on how to get negative lookahead matching working > would be great. > the regex (/thread/[0-9]*)(?!\/) still matches "/thread/28606/" and > I'd assumed it wouldn't. Putting these expressions into Regex Demo is enlightening - the regex matches against "/thread/2860" - in other words the "not /" is matching against the 6. You don't give an example of what you do want to match so it's hard to know what a better solution is. Some possibilities - match anything except a digit or a slash - [^0-9/] - match the end of the string - $ - both of the above - ([^0-9/]|$) Kent > > Regards, > > Liam Clarke > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor