Re: [Tutor] RE help

Kent Johnson Tue, 15 Feb 2005 12:32:05 -0800

Try it with non-greedy matches. You are matching everything from the first <hX><a to the last </p> in one match. Also I think you want to escape the . before </p> (you want just paragraphs that end in a period?)

pattern = re.compile("""<h[1-2]><a href="/(.*?)">(.*?)\.</p>""", re.DOTALL)

Kent

Ron Nixon wrote:

Trying to scrape a newspaper site for articles using
this code whic ws done with help from the list:

import urllib, re pattern = re.compile("""<h[1-2]><a href="/(.*)">(.*).</p>""", re.DOTALL) page =urllib.urlopen("http://www.startribune.com";).read()

for headline, body in pattern.findall(page):
    print body

It should grab articles from this:

<h2><a href="/stories/507/5240764.html">Sid Hartman:
Franchise could be moved</a></h2><p>If Reggie Fowler
and his business partners from New Jersey are approved
to buy the Vikings franchise from Red McCombs, it is
my opinion the franchise remains in danger of
eventually being relocated.</p>

and give me this: Sid Hartman: Franchise could be
moved</a></h2><p>If Reggie Fowler and his business
partners from New Jersey are approved to buy the
Vikings franchise from Red McCombs, it is my opinion
the franchise remains in danger of eventually being
relocated.

Instead it gives me this:<b>Boxerjam</b></a>. from
this :
href="http://www.startribune.com/stories/1559/4773140.html";><b>Boxerjam</b></a>.
</p></div>

I know the re works in other programs I've tried. Is
there something different about re's in Python?

__________________________________ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor


_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] RE help

Reply via email to