Re: web page text extractor

Andre Engels Thu, 12 Jul 2007 07:24:26 -0700

2007/7/12, kublai <[EMAIL PROTECTED]>:

> For a project, I need to develop a corpus of online news stories.  I'm
> looking for an application that, given the url of a web page, "copies"
> the rendered text of the web page (not the source HTNL text), opens a
> text editor (Notepad), and displays the copied text for the user to
> examine and save into a text file. Graphics and sidebars to be
> ignored. The examples I have come across are much too complex for me
> to customize for this simple job. Can anyone lead me to the right
> direction?


def textonly(url):
   # Get the HTML source on url and give only the main text
   f = urllib2.urlopen(url)
   text = f.read()
   r = re.compile('\<[^\<\>]*\>')
   newtext = r.sub('',text)
   while newtext != text:
      text = newtext
      newtext = r.sub('',text)
   return text



-- 
Andre Engels, [EMAIL PROTECTED]
ICQ: 6260644  --  Skype: a_engels
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: web page text extractor

Reply via email to