Re: Web Crawling/Threading and Things That Go Bump in the Night

2006-08-04 Thread [EMAIL PROTECTED]
Rem, what OS are you trying this on? Windows XP SP2 has a limit of around 40 tcp connections per second... Remarkable wrote: > Hello all > > I am trying to write a reliable web-crawler. I tried to write my own > using recursion and found I quickly hit the "too many sockets" open > problem. So I lo

Re: web crawling.

2006-01-19 Thread John M. Gabriele
Alex Martelli wrote: > S Borg <[EMAIL PROTECTED]> wrote: > > >> Hello, >> >> I have been writing very simple Python programs that parse HTML and >>such, mainly just to get >>a better feel for the language. Here is my question: If I parsed an >>HTML page into all of the image >>files listed on tha

Re: web crawling.

2006-01-19 Thread Fuzzyman
Use BeautifulSoup to get all the image tags out of the html. You'll need to join the urls of the images to the url of the page (urlparse.urljoin off the top of my head). If you look at BeautifulSoup you will see how to get the 'src' reference of each image tag. All the best, Fuzzyman http://www.

Re: web crawling.

2006-01-19 Thread gene tani
S Borg wrote: > Hello, > > I have been writing very simple Python programs that parse HTML and > such, mainly just to get > a better feel for the language. Here is my question: If I parsed an > HTML page into all of the image > files listed on that page, how could I request all of those images an

Re: web crawling.

2006-01-18 Thread Alex Martelli
S Borg <[EMAIL PROTECTED]> wrote: > Hello, > > I have been writing very simple Python programs that parse HTML and > such, mainly just to get > a better feel for the language. Here is my question: If I parsed an > HTML page into all of the image > files listed on that page, how could I request