[Tutor] retrieve URLs and text from web pages

2010-06-27 Thread Khawla Al-Wehaibi
Hi,

I’m new to programming. I’m currently learning python to write a web crawler to 
extract all text from a web page, in addition to, crawling to further URLs and 
collecting the text there. The idea is to place all the extracted text in a 
.txt file with each word in a single line. So the text has to be tokenized. All 
punctuation marks, duplicate words and non-stop words have to be removed.

The program should crawl the web to a certain depth and collect the URLs and 
text from each depth (level). I decided to choose a depth of 3. I divided the 
code to two parts. Part one to collect the URLs and part two to extract the 
text. Here is my problem:

1.    The program is extremely slow. 
2.    I'm not sure if it functions properly.
3.    Is there a better way to extract text?
4.    Are there any available modules to help clean the text i.e. removing 
duplicates, non-stop words ...
5.    Any suggestions or feedback is appreciated.

(Please Note: the majority of the code (the first part) is written by “James 
Mills”. I found the code online and it looks helpful so I used it. I just 
modified it and added my code to it)

Thanks,
Kal

import sys
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup,NavigableString

__version__ = "0.1"
__copyright__ = "CopyRight (C) 2008 by James Mills"
__license__ = "GPL"
__author__ = "James Mills"
__author_email__ = "James Mills, James dot Mills st dotred dot com dot au"

USAGE = "%prog [options] "
VERSION = "%prog v" + __version__

AGENT = "%s/%s" % (__name__, __version__)

def encodeHTML(s=""):
    """encodeHTML(s) -> str

    Encode HTML special characters from their ASCII form to
    HTML entities.
    """

    return s.replace("&", "&") \
    .replace("<", "<") \
    .replace(">", ">") \
    .replace("\"", """) \
    .replace("'", "'") \
    .replace("--", "&mdash")

class Fetcher(object):

    def __init__(self, url):
    self.url = url
    self.urls = []

    def __contains__(self, x):
    return x in self.urls

    def __getitem__(self, x):
    return self.urls[x]

    def _addHeaders(self, request):
    request.add_header("User-Agent", AGENT)

    def open(self):
    url = self.url
    #print "\nFollowing %s" % url
    try:
    request = urllib2.Request(url)
    handle = urllib2.build_opener()
    except IOError:
    return None
    return (request, handle)

    def fetch(self):
    request, handle = self.open()
    self._addHeaders(request)
    if handle:
    soup = BeautifulSoup()
    try:
    content = unicode(handle.open(request).read(), errors="ignore")
    soup.feed(content)
    #soup = BeautifulSoup(content)
    tags = soup('a')
    except urllib2.HTTPError, error:
    if error.code == 404:
    print >> sys.stderr, "ERROR: %s -> %s" % (error, error.url)
    else:
    print >> sys.stderr, "ERROR: %s" % error
    tags = []
    except urllib2.URLError, error:
    print >> sys.stderr, "ERROR: %s" % error
    tags = []
    for tag in tags:
    try:
    href = tag["href"]
    if href is not None:
    url = urlparse.urljoin(self.url, encodeHTML(href))
    if url not in self:
    #print " Found: %s" % url
    self.urls.append(url)
    except KeyError:
    pass


# I created 3 lists (root, level2 and level3). #
# Each list saves the URLs of that level i.e. depth. I choose to create 3  #
# lists so I can have the flexibility of testing the text in each level. Also, #
# the 3 lists can be easily combined into one list.    #


# Level1: 
root = Fetcher('http://www.wantasimplewebsite.co.uk/index.html')
root.fetch()
for url in root:
    if url not in root: # Avoid duplicate links 
   root.append(url)
   
print "\nRoot URLs are:"
for i, url in enumerate(root):
   print "%d. %s" % (i+1, url)

# Level2: 
level2 = []
for url in root: # Traverse every element(i.e URL) in root and fetch the URLs 
from it
    temp = Fetcher(url)
    temp.fetch()
    for url in temp:
    if url not in level2: # Avoid duplicate links 
   level2.append(url)   
   
print "\nLevel2 URLs are:"
for i, url in enumerate(level2):
   print "%d. %s" % (i+1, url)
 
# Level3: 
level3 = []
for url in level2: # Traverse every element(i.e URL) in level2 and fetch the 
URLs from it
    temp = Fetcher(url)
    temp.fetch()
    for url in temp:
    if url not in level3: # Avoid duplicate links
   leve

Re: [Tutor] retrieve URLs and text from web pages

2010-06-29 Thread Khawla Al-Wehaibi
Thanks Tino. Sorry for the way the post looks. It is terrible to read.

I decided to go with Regular Expressions to modify the text. In the Python.org 
it is stated that they provide more options and flexibilty compared to strings 
and their modules.

Thanks



--- On Tue, 29/6/10, Tino Dai  wrote:

From: Tino Dai 
Subject: Re: [Tutor] retrieve URLs and text from web pages
To: "Khawla Al-Wehaibi" 
Cc: tutor@python.org
Date: Tuesday, 29 June, 2010, 5:34


On Sun, Jun 27, 2010 at 12:15 PM, Khawla Al-Wehaibi  wrote:

Hi,

I’m new to programming. I’m currently learning python to write a web crawler to 
extract all text from a web page, in addition to, crawling to further URLs and 
collecting the text there. The idea is to place all the extracted text in a 
.txt file with each word in a single line. So the text has to be tokenized. All 
punctuation marks, duplicate words and non-stop words have to be removed.

 Welcome to Python! What you are doing is best done in a multi step process so 
that you can understand everything that you are doing. To really
leverage Python, there are a couple of things that you need to read right off 
the bat.


http://docs.python.org/library/stdtypes.html   (Stuff about strings). In 
Python, everything is an object so everything will have methods or functions 
related to it. For instance, the String object has a find method that will 
return position of the string. Pretty handy if you ask me.


Also, I would read up on sets for python. That will reduce the size of your 
code significantly. 


The program should crawl the web to a certain depth and collect the URLs and 
text from each depth (level). I decided to choose a depth of 3. I divided the 
code to two parts. Part one to collect the URLs and part two to extract the 
text. Here is my problem:


1.    The program is extremely slow. 

The best way to go about this is to use a profiler:

 http://docs.python.org/library/profile.html



2.    I'm not sure if it functions properly.

To debug your code, you may want to read up on the python debugger.
 http://docs.python.org/library/pdb.html



3.    Is there a better way to extract
 text?

See the strings and the lists. I think that you will be pleasantly surprised
 

4.    Are there any available modules to help clean the text i.e. removing 
duplicates, non-stop words ...


Read up on sets and the string functions/method. They are your friend 

5.    Any suggestions or feedback is appreciated.


-Tino

PS: Please don't send html ladden emails, it makes it harder to work with. 
Thanks 





  ___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor