Thanks Tino. Sorry for the way the post looks. It is terrible to read.

I decided to go with Regular Expressions to modify the text. In the Python.org 
it is stated that they provide more options and flexibilty compared to strings 
and their modules.

Thanks



--- On Tue, 29/6/10, Tino Dai <obe...@gmail.com> wrote:

From: Tino Dai <obe...@gmail.com>
Subject: Re: [Tutor] retrieve URLs and text from web pages
To: "Khawla Al-Wehaibi" <kweha...@yahoo.com>
Cc: tutor@python.org
Date: Tuesday, 29 June, 2010, 5:34


On Sun, Jun 27, 2010 at 12:15 PM, Khawla Al-Wehaibi <kweha...@yahoo.com> wrote:

Hi,

I’m new to programming. I’m currently learning python to write a web crawler to 
extract all text from a web page, in addition to, crawling to further URLs and 
collecting the text there. The idea is to place all the extracted text in a 
.txt file with each word in a single line. So the text has to be tokenized. All 
punctuation marks, duplicate words and non-stop words have to be removed.

 Welcome to Python! What you are doing is best done in a multi step process so 
that you can understand everything that you are doing. To really
leverage Python, there are a couple of things that you need to read right off 
the bat.


http://docs.python.org/library/stdtypes.html   (Stuff about strings). In 
Python, everything is an object so everything will have methods or functions 
related to it. For instance, the String object has a find method that will 
return position of the string. Pretty handy if you ask me.


Also, I would read up on sets for python. That will reduce the size of your 
code significantly. 


The program should crawl the web to a certain depth and collect the URLs and 
text from each depth (level). I decided to choose a depth of 3. I divided the 
code to two parts. Part one to collect the URLs and part two to extract the 
text. Here is my problem:


1.    The program is extremely slow. 

The best way to go about this is to use a profiler:

 http://docs.python.org/library/profile.html



2.    I'm not sure if it functions properly.

To debug your code, you may want to read up on the python debugger.
 http://docs.python.org/library/pdb.html



3.    Is there a better way to extract
 text?

See the strings and the lists. I think that you will be pleasantly surprised
 

4.    Are there any available modules to help clean the text i.e. removing 
duplicates, non-stop words ...


Read up on sets and the string functions/method. They are your friend 

5.    Any suggestions or feedback is appreciated.


-Tino

PS: Please don't send html ladden emails, it makes it harder to work with. 
Thanks 





      
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to