urllib2 and threading
I am writing a program that involves visiting several hundred webpages
and extracting specific information from the contents. I've written a
modest 'test' example here that uses a multi-threaded approach to
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that) but
the example shown here is simplified and just confirms the url of the
site visited.
Here's the problem: the script simply crashes after getting a a couple
of urls and takes a long time to run (slower that a non-threaded
version that I wrote and ran). Can anyone figure out what I am doing
wrong? I am new to both threading and urllib2, so its possible that
the SNAFU is quite obvious.
The urls are stored in a text file that I read from. The urls are all
valid, so there's no problem there.
Here's the code:
#!/usr/bin/python
import urllib2
import threading
class MyThread(threading.Thread):
"""subclass threading.Thread to create Thread instances"""
def __init__(self, func, args):
threading.Thread.__init__(self)
self.func = func
self.args = args
def run(self):
apply(self.func, self.args)
def get_info_from_url(url):
""" A dummy version of the function simply visits urls and prints
the url of the page. """
try:
page = urllib2.urlopen(url)
except urllib2.URLError, e:
print " error ", e.reason
except urllib2.HTTPError, e:
print " error ", e.code
else:
ulock.acquire()
print page.geturl() # obviously, do something more useful here,
eventually
page.close()
ulock.release()
ulock = threading.Lock()
num_links = 10
threads = [] # store threads here
urls = [] # store urls here
fh = open("links.txt", "r")
for line in fh:
urls.append(line.strip())
fh.close()
# collect threads
for i in range(num_links):
t = MyThread(get_info_from_url, (urls[i],) )
threads.append(t)
# start the threads
for i in range(num_links):
threads[i].start()
for i in range(num_links):
threads[i].join()
print "all done"
--
http://mail.python.org/mailman/listinfo/python-list
Re: urllib2 and threading
Thanks for your reply. Obviously you make several good points about Beautiful Soup and Queue. But here's the problem: even if I do nothing whatsoever with the threads beyond just visiting the urls with urllib2, the program chokes. If I replace else: ulock.acquire() print page.geturl() # obviously, do something more useful here,eventually page.close() ulock.release() with else: pass the urllib2 starts raising URLErrros after the first 3 - 5 urls have been visited. Do you have any sense what in the threads is corrupting urllib2's behavior? Many thanks, Robean On May 1, 12:27 am, Paul Rubin <http://[email protected]> wrote: > robean writes: > > reach the urls with urllib2. The actual program will involve fairly > > elaborate scraping and parsing (I'm using Beautiful Soup for that) but > > the example shown here is simplified and just confirms the url of the > > site visited. > > Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot > of pages and have multiple cpu's, you probably want parallel processes > rather than threads. > > > wrong? I am new to both threading and urllib2, so its possible that > > the SNAFU is quite obvious.. > > ... > > ulock = threading.Lock() > > Without looking at the code for more than a few seconds, using an > explicit lock like that is generally not a good sign. The usual > Python style is to send all inter-thread communications through > Queues. You'd dump all your url's into a queue and have a bunch of > worker threads getting items off the queue and processing them. This > really avoids a lot of lock-related headache. The price is that you > sometimes use more threads than strictly necessary. Unless it's a LOT > of extra threads, it's usually not worth the hassle of messing with > locks. -- http://mail.python.org/mailman/listinfo/python-list
urllib2 and exceptions
Hi everyone, I have a question about using urllib2. I like urllib2 better than urllib at least in part because it has more elaborate support for handling errors: there is built in support for URLError (for faulty urls) and HTTPError (for http errors that might originate from, say, passing an invalid stock-ticker in the program below). However I can get neither to work. I'm attaching below the (very short) code: can anyone point out what I'm doing wrong? Now, if I replace the URLError and HTTPError with IOError (the class from which both URLError and HTTPError inherit), the program works fine. Why is it that I can call the generic IOError class, but none of the Error classes derived from that? These are clearly defined in the urllib2 manual. Very confused... Here's the code: import urllib2 # read stock information from yahoo finance for Traget (TGT) goodTicker = 'TGT' # program works with this badTicker = 'TGTttt' # python doesn't understand either HTTPError or URLError with this url = "http://ichart.finance.yahoo.com/table.csv?s="; + badTicker try: handle = urllib2.urlopen(url) # this does not work except HTTPError, e: print "There was an http error" print e # this also does not work except URLError, e: print "There is a problem with the URL" print e exit(1) #this works except IOError, e: print "You have an IOError" print e text = handle.readlines()[:20] for line in text: print line -- http://mail.python.org/mailman/listinfo/python-list
Re: urllib2 and exceptions
On Sep 28, 12:11 pm, "Chris Rebert" <[EMAIL PROTECTED]> wrote: > On Sun, Sep 28, 2008 at 11:03 AM, robean <[EMAIL PROTECTED]> wrote: > > Hi everyone, > > > I have a question about using urllib2. > > > I like urllib2 better than urllib at least in part because it has more > > elaborate support for handling errors: there is built in support for > > URLError (for faulty urls) and HTTPError (for http errors that might > > originate from, say, passing an invalid stock-ticker in the program > > below). However I can get neither to work. I'm attaching below the > > (very short) code: can anyone point out what I'm doing wrong? > > > Now, if I replace the URLError and HTTPError with IOError (the class > > from which both URLError and HTTPError inherit), the program works > > fine. Why is it that I can call the generic IOError class, but none of > > the Error classes derived from that? These are clearly defined in the > > urllib2 manual. Very confused... > > > Here's the code: > > > import urllib2 > > > # read stock information from yahoo finance for Traget (TGT) > > goodTicker = 'TGT' # program works with this > > badTicker = 'TGTttt' # python doesn't understand either HTTPError > > or URLError with this > > > url = "http://ichart.finance.yahoo.com/table.csv?s="; + badTicker > > > try: > > handle = urllib2.urlopen(url) > > > # this does not work > > except HTTPError, e: > > print "There was an http error" > > print e > > > # this also does not work > > except URLError, e: > > print "There is a problem with the URL" > > print e > > exit(1) > > > #this works > > except IOError, e: > > print "You have an IOError" > > print e > > > text = handle.readlines()[:20] > > for line in text: > > print line > > > -- > >http://mail.python.org/mailman/listinfo/python-list > > My Python begs to differ: > > #tmp.py > import urllib2 > > badTicker = 'TGTttt' > url = "http://ichart.finance.yahoo.com/table.csv?s="; + badTicker > > try: > handle = urllib2.urlopen(url) > > except urllib2.HTTPError, e: > print "There was an http error" > print e > > except urllib2.URLError, e: > print "There is a problem with the URL" > print e > > except urllib2.IOError, e: > print "You have an IOError" > print e > > #in the shell > $ python -V > Python 2.5.1 > $ python Desktop/tmp.py > There was an http error > HTTP Error 404: Not Found > > Are you using an outdated version of Python perhaps? > > Regards, > Chris > > -- > Follow the path of the Iguana...http://rebertia.com Then I expect that it is most likely my version of python that is causing the problem. I'm using 2.5.2. -- http://mail.python.org/mailman/listinfo/python-list
Re: urllib2 and exceptions
On Sep 28, 12:27 pm, robean <[EMAIL PROTECTED]> wrote: > On Sep 28, 12:11 pm, "Chris Rebert" <[EMAIL PROTECTED]> wrote: > > > > > On Sun, Sep 28, 2008 at 11:03 AM, robean <[EMAIL PROTECTED]> wrote: > > > Hi everyone, > > > > I have a question about using urllib2. > > > > I like urllib2 better than urllib at least in part because it has more > > > elaborate support for handling errors: there is built in support for > > > URLError (for faulty urls) and HTTPError (for http errors that might > > > originate from, say, passing an invalid stock-ticker in the program > > > below). However I can get neither to work. I'm attaching below the > > > (very short) code: can anyone point out what I'm doing wrong? > > > > Now, if I replace the URLError and HTTPError with IOError (the class > > > from which both URLError and HTTPError inherit), the program works > > > fine. Why is it that I can call the generic IOError class, but none of > > > the Error classes derived from that? These are clearly defined in the > > > urllib2 manual. Very confused... > > > > Here's the code: > > > > import urllib2 > > > > # read stock information from yahoo finance for Traget (TGT) > > > goodTicker = 'TGT' # program works with this > > > badTicker = 'TGTttt' # python doesn't understand either HTTPError > > > or URLError with this > > > > url = "http://ichart.finance.yahoo.com/table.csv?s="; + badTicker > > > > try: > > > handle = urllib2.urlopen(url) > > > > # this does not work > > > except HTTPError, e: > > > print "There was an http error" > > > print e > > > > # this also does not work > > > except URLError, e: > > > print "There is a problem with the URL" > > > print e > > > exit(1) > > > > #this works > > > except IOError, e: > > > print "You have an IOError" > > > print e > > > > text = handle.readlines()[:20] > > > for line in text: > > > print line > > > > -- > > >http://mail.python.org/mailman/listinfo/python-list > > > My Python begs to differ: > > > #tmp.py > > import urllib2 > > > badTicker = 'TGTttt' > > url = "http://ichart.finance.yahoo.com/table.csv?s="; + badTicker > > > try: > > handle = urllib2.urlopen(url) > > > except urllib2.HTTPError, e: > > print "There was an http error" > > print e > > > except urllib2.URLError, e: > > print "There is a problem with the URL" > > print e > > > except urllib2.IOError, e: > > print "You have an IOError" > > print e > > > #in the shell > > $ python -V > > Python 2.5.1 > > $ python Desktop/tmp.py > > There was an http error > > HTTP Error 404: Not Found > > > Are you using an outdated version of Python perhaps? > > > Regards, > > Chris > > > -- > > Follow the path of the Iguana...http://rebertia.com > > Then I expect that it is most likely my version of python that is > causing the problem. I'm using 2.5.2. Actually, the problem seems to be that IOError is in my namespace, but the other error classes are not. So, except HTTPError, etc. fails, but except urllib2.HttpError, etc. works fine. Now, I still don't understand why these classes shouldn't automatically work -- http://mail.python.org/mailman/listinfo/python-list
Re: urllib2 and exceptions
On Sep 28, 5:33 pm, alex23 <[EMAIL PROTECTED]> wrote: > On Sep 29, 5:52 am, robean <[EMAIL PROTECTED]> wrote: > > > Actually, the problem seems to be that IOError is in my namespace, but > > the other error classes are not. So, > > > except HTTPError, etc. > > > fails, but > > > except urllib2.HttpError, etc. > > > works fine. Now, I still don't understand why these classes shouldn't > > automatically work > > IOError is a standard Python exception. HTTPError & URLError are > exceptions provided by the urllib2 module. They need to be imported > from or referenced through urllib2 to be used. Many thanks for your reply. I was simply under the impression that 'import urllib2' would take care of the namespace issue and simply import everything in urlib2, making it unnecessary to have to reference HTTPError and URLError. Sorry for being dense about this (I'm very new to Python).. Again, thanks for your help. -- http://mail.python.org/mailman/listinfo/python-list
Professional quality scripts/code
I have been learning Python for the last 3 months or so and I have a working (but somewhat patchy) sense of the the language. I've been using a couple of the more popular Python books as well as online resources. A question for experienced Python programmers: can you recommend resources where I can look at high quality Python code and scripts? I've spent some time at http://code.activestate.com/recipes/ but am concerned that the quality of what is posted there can be somewhat hit and miss. What I have in mind is a site like cpan, where one can look at the actual source code of many of the modules and learn a thing or two about idiomatic Perl programming from studying the better ones. Any sites like that for Python? (You can of course look up Python modules on docs.python.org, but, as far as I can tell, not the actual source code). Many thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: Professional quality scripts/code
On Oct 3, 1:26 am, Bruno Desthuilliers wrote: > robean a crit : > > > I have been learning Python for the last 3 months or so and I have a > > working (but somewhat patchy) sense of the the language. I've been > > using a couple of the more popular Python books as well as online > > resources. > > > A question for experienced Python programmers: can you recommend > > resources where I can look at high quality Python code and scripts? > > Well... Not everything is 'high quality' in it[1], but why not start > with the stdlib ? Most of it is pure Python, opensource code and is > already installed on your machine, isn't it ?-) > > [1] IIRC, last time I had a look at the zipfile module's code, it was > more of a Q&D hack than anything else - now as long as it works fine for > what I do with it and I don't have to maintain it, well, that's fine. > > > I've spent some time athttp://code.activestate.com/recipes/but am > > concerned that the quality of what is posted there can be somewhat hit > > and miss. > > Indeed. > > > What I have in mind is a site like cpan, where one can look > > at the actual source code of many of the modules and learn a thing or > > two about idiomatic Perl programming from studying the better ones. > > Any sites like that for Python? > > Lurking here is probably a good way to see a lot of code reviews. And > even possibly to submit snippets of your own code to review. Some (if > not most) of us here like to show how good we are at improving the poor > newbies code !-) Many thanks, Mike and Bruno, The resources you mention are exactly the kind of stuff I was looking for. Soon enough I hope to give all of you many chances to improve this poor newbie's code...! - Robean -- http://mail.python.org/mailman/listinfo/python-list
