urllib2 and threading

2009-04-30 Thread robean
I am writing a program that involves visiting several hundred webpages
and extracting specific information from the contents. I've written a
modest 'test' example here that uses a multi-threaded approach to
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that) but
the example shown here is simplified and just confirms the url of the
site visited.

Here's the problem: the script simply crashes after getting a a couple
of urls and takes a long time to run (slower that a non-threaded
version that I wrote and ran). Can anyone figure out what I am doing
wrong? I am new to both threading and urllib2, so its possible that
the SNAFU is quite obvious.

The urls are stored in a text file that I read from. The urls are all
valid, so there's no problem there.

Here's the code:

#!/usr/bin/python

import urllib2
import threading

class MyThread(threading.Thread):
  """subclass threading.Thread to create Thread instances"""
  def __init__(self, func, args):
threading.Thread.__init__(self)
self.func = func
self.args = args

  def run(self):
apply(self.func, self.args)


def get_info_from_url(url):
  """ A dummy version of the function simply visits urls and prints
the url of the page. """
  try:
page = urllib2.urlopen(url)
  except urllib2.URLError, e:
print " error ", e.reason
  except urllib2.HTTPError, e:
print " error ", e.code

  else:
ulock.acquire()
print page.geturl() # obviously, do something more useful here,
eventually
page.close()
ulock.release()

ulock = threading.Lock()
num_links = 10
threads = [] # store threads here
urls = [] # store urls here

fh = open("links.txt", "r")
for line in fh:
  urls.append(line.strip())
fh.close()

# collect threads
for i in range(num_links):
  t = MyThread(get_info_from_url, (urls[i],) )
  threads.append(t)

# start the threads
for i in range(num_links):
  threads[i].start()

for i in range(num_links):
  threads[i].join()

print "all done"

--
http://mail.python.org/mailman/listinfo/python-list


Re: urllib2 and threading

2009-05-01 Thread robean
Thanks for your reply. Obviously you make several good points about
Beautiful Soup and Queue. But here's the problem: even if I do nothing
whatsoever with the threads beyond just visiting the urls with
urllib2, the program chokes. If I replace

  else:
ulock.acquire()
print page.geturl() # obviously, do something more useful
here,eventually
page.close()
ulock.release()

with

  else:
pass

the urllib2 starts raising URLErrros after the first 3 - 5 urls have
been visited. Do you have any sense what in the threads is corrupting
urllib2's behavior?  Many thanks,

Robean



On May 1, 12:27 am, Paul Rubin <http://[email protected]> wrote:
> robean  writes:
> > reach the urls with urllib2. The actual program will involve fairly
> > elaborate scraping and parsing (I'm using Beautiful Soup for that) but
> > the example shown here is simplified and just confirms the url of the
> > site visited.
>
> Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot
> of pages and have multiple cpu's, you probably want parallel processes
> rather than threads.
>
> > wrong? I am new to both threading and urllib2, so its possible that
> > the SNAFU is quite obvious..
> > ...
> > ulock = threading.Lock()
>
> Without looking at the code for more than a few seconds, using an
> explicit lock like that is generally not a good sign.  The usual
> Python style is to send all inter-thread communications through
> Queues.  You'd dump all your url's into a queue and have a bunch of
> worker threads getting items off the queue and processing them.  This
> really avoids a lot of lock-related headache.  The price is that you
> sometimes use more threads than strictly necessary.  Unless it's a LOT
> of extra threads, it's usually not worth the hassle of messing with
> locks.

--
http://mail.python.org/mailman/listinfo/python-list


urllib2 and exceptions

2008-09-28 Thread robean
Hi everyone,

I have a question about using urllib2.

I like urllib2 better than urllib at least in part because it has more
elaborate support for handling errors: there is built in support for
URLError (for faulty urls) and HTTPError (for http errors that might
originate from, say, passing an invalid stock-ticker in the program
below).  However I can get neither to work.  I'm attaching below the
(very short) code: can anyone point out what I'm doing wrong?

Now, if I replace the URLError and HTTPError with IOError (the class
from which both URLError and HTTPError inherit), the program works
fine. Why is it that I can call the generic IOError class, but none of
the Error classes derived from that? These are clearly defined in the
urllib2 manual. Very confused...

Here's the code:


import urllib2

# read stock information from yahoo finance for Traget (TGT)
goodTicker = 'TGT' # program works with this
badTicker = 'TGTttt' # python doesn't understand either HTTPError
or URLError with this

url = "http://ichart.finance.yahoo.com/table.csv?s="; + badTicker

try:
handle = urllib2.urlopen(url)

# this does not work
except HTTPError, e:
print "There was an http error"
print e

# this also does not work
except URLError, e:
print "There is a problem with the URL"
print e
exit(1)

#this works
except IOError, e:
print "You have an IOError"
print e

text = handle.readlines()[:20]
for line in text:
print line

--
http://mail.python.org/mailman/listinfo/python-list


Re: urllib2 and exceptions

2008-09-28 Thread robean
On Sep 28, 12:11 pm, "Chris Rebert" <[EMAIL PROTECTED]> wrote:
> On Sun, Sep 28, 2008 at 11:03 AM, robean <[EMAIL PROTECTED]> wrote:
> > Hi everyone,
>
> > I have a question about using urllib2.
>
> > I like urllib2 better than urllib at least in part because it has more
> > elaborate support for handling errors: there is built in support for
> > URLError (for faulty urls) and HTTPError (for http errors that might
> > originate from, say, passing an invalid stock-ticker in the program
> > below).  However I can get neither to work.  I'm attaching below the
> > (very short) code: can anyone point out what I'm doing wrong?
>
> > Now, if I replace the URLError and HTTPError with IOError (the class
> > from which both URLError and HTTPError inherit), the program works
> > fine. Why is it that I can call the generic IOError class, but none of
> > the Error classes derived from that? These are clearly defined in the
> > urllib2 manual. Very confused...
>
> > Here's the code:
>
> > import urllib2
>
> > # read stock information from yahoo finance for Traget (TGT)
> > goodTicker = 'TGT' # program works with this
> > badTicker = 'TGTttt' # python doesn't understand either HTTPError
> > or URLError with this
>
> > url = "http://ichart.finance.yahoo.com/table.csv?s="; + badTicker
>
> > try:
> >        handle = urllib2.urlopen(url)
>
> > # this does not work
> > except HTTPError, e:
> >        print "There was an http error"
> >        print e
>
> > # this also does not work
> > except URLError, e:
> >        print "There is a problem with the URL"
> >        print e
> >        exit(1)
>
> > #this works
> > except IOError, e:
> >        print "You have an IOError"
> >        print e
>
> > text = handle.readlines()[:20]
> > for line in text:
> >        print line
>
> > --
> >http://mail.python.org/mailman/listinfo/python-list
>
> My Python begs to differ:
>
> #tmp.py
> import urllib2
>
> badTicker = 'TGTttt'
> url = "http://ichart.finance.yahoo.com/table.csv?s="; + badTicker
>
> try:
>     handle = urllib2.urlopen(url)
>
> except urllib2.HTTPError, e:
>     print "There was an http error"
>     print e
>
> except urllib2.URLError, e:
>     print "There is a problem with the URL"
>     print e
>
> except urllib2.IOError, e:
>     print "You have an IOError"
>     print e
>
> #in the shell
> $ python -V
> Python 2.5.1
> $ python Desktop/tmp.py
> There was an http error
> HTTP Error 404: Not Found
>
> Are you using an outdated version of Python perhaps?
>
> Regards,
> Chris
>
> --
> Follow the path of the Iguana...http://rebertia.com

Then I expect that it is most likely my version of python that is
causing the problem. I'm using 2.5.2.
--
http://mail.python.org/mailman/listinfo/python-list


Re: urllib2 and exceptions

2008-09-28 Thread robean
On Sep 28, 12:27 pm, robean <[EMAIL PROTECTED]> wrote:
> On Sep 28, 12:11 pm, "Chris Rebert" <[EMAIL PROTECTED]> wrote:
>
>
>
> > On Sun, Sep 28, 2008 at 11:03 AM, robean <[EMAIL PROTECTED]> wrote:
> > > Hi everyone,
>
> > > I have a question about using urllib2.
>
> > > I like urllib2 better than urllib at least in part because it has more
> > > elaborate support for handling errors: there is built in support for
> > > URLError (for faulty urls) and HTTPError (for http errors that might
> > > originate from, say, passing an invalid stock-ticker in the program
> > > below).  However I can get neither to work.  I'm attaching below the
> > > (very short) code: can anyone point out what I'm doing wrong?
>
> > > Now, if I replace the URLError and HTTPError with IOError (the class
> > > from which both URLError and HTTPError inherit), the program works
> > > fine. Why is it that I can call the generic IOError class, but none of
> > > the Error classes derived from that? These are clearly defined in the
> > > urllib2 manual. Very confused...
>
> > > Here's the code:
>
> > > import urllib2
>
> > > # read stock information from yahoo finance for Traget (TGT)
> > > goodTicker = 'TGT' # program works with this
> > > badTicker = 'TGTttt' # python doesn't understand either HTTPError
> > > or URLError with this
>
> > > url = "http://ichart.finance.yahoo.com/table.csv?s="; + badTicker
>
> > > try:
> > >        handle = urllib2.urlopen(url)
>
> > > # this does not work
> > > except HTTPError, e:
> > >        print "There was an http error"
> > >        print e
>
> > > # this also does not work
> > > except URLError, e:
> > >        print "There is a problem with the URL"
> > >        print e
> > >        exit(1)
>
> > > #this works
> > > except IOError, e:
> > >        print "You have an IOError"
> > >        print e
>
> > > text = handle.readlines()[:20]
> > > for line in text:
> > >        print line
>
> > > --
> > >http://mail.python.org/mailman/listinfo/python-list
>
> > My Python begs to differ:
>
> > #tmp.py
> > import urllib2
>
> > badTicker = 'TGTttt'
> > url = "http://ichart.finance.yahoo.com/table.csv?s="; + badTicker
>
> > try:
> >     handle = urllib2.urlopen(url)
>
> > except urllib2.HTTPError, e:
> >     print "There was an http error"
> >     print e
>
> > except urllib2.URLError, e:
> >     print "There is a problem with the URL"
> >     print e
>
> > except urllib2.IOError, e:
> >     print "You have an IOError"
> >     print e
>
> > #in the shell
> > $ python -V
> > Python 2.5.1
> > $ python Desktop/tmp.py
> > There was an http error
> > HTTP Error 404: Not Found
>
> > Are you using an outdated version of Python perhaps?
>
> > Regards,
> > Chris
>
> > --
> > Follow the path of the Iguana...http://rebertia.com
>
> Then I expect that it is most likely my version of python that is
> causing the problem. I'm using 2.5.2.

Actually, the problem seems to be that IOError is in my namespace, but
the other error classes are not. So,

   except HTTPError, etc.

fails, but

   except urllib2.HttpError, etc.

works fine. Now, I still don't understand why these classes shouldn't
automatically work
--
http://mail.python.org/mailman/listinfo/python-list


Re: urllib2 and exceptions

2008-09-28 Thread robean
On Sep 28, 5:33 pm, alex23 <[EMAIL PROTECTED]> wrote:
> On Sep 29, 5:52 am, robean <[EMAIL PROTECTED]> wrote:
>
> > Actually, the problem seems to be that IOError is in my namespace, but
> > the other error classes are not. So,
>
> >    except HTTPError, etc.
>
> > fails, but
>
> >    except urllib2.HttpError, etc.
>
> > works fine. Now, I still don't understand why these classes shouldn't
> > automatically work
>
> IOError is a standard Python exception. HTTPError & URLError are
> exceptions provided by the urllib2 module. They need to be imported
> from or referenced through urllib2 to be used.

Many thanks for your reply. I was simply under the impression that
'import urllib2' would take care of the namespace issue and simply
import everything in urlib2, making it unnecessary to have to
reference HTTPError and URLError.  Sorry for being dense about this
(I'm very new to Python).. Again, thanks for your help.
--
http://mail.python.org/mailman/listinfo/python-list


Professional quality scripts/code

2008-10-02 Thread robean
I have been learning Python for the last 3 months or so and I have a
working (but somewhat patchy) sense of the the language. I've been
using a couple of the more popular Python books as well as online
resources.

A question for experienced Python programmers: can you recommend
resources where I can look at high quality Python code and scripts?
I've spent some time at http://code.activestate.com/recipes/ but am
concerned that the quality of what is posted there can be somewhat hit
and miss.  What I have in mind is a site like cpan, where one can look
at the actual source code of many of the modules and learn a thing or
two about idiomatic Perl programming from studying the better ones.
Any sites like that for Python? (You can of course look up Python
modules on docs.python.org, but, as far as I can tell, not the actual
source code). Many thanks.




--
http://mail.python.org/mailman/listinfo/python-list


Re: Professional quality scripts/code

2008-10-04 Thread robean
On Oct 3, 1:26 am, Bruno Desthuilliers  wrote:
> robean a crit :
>
> > I have been learning Python for the last 3 months or so and I have a
> > working (but somewhat patchy) sense of the the language. I've been
> > using a couple of the more popular Python books as well as online
> > resources.
>
> > A question for experienced Python programmers: can you recommend
> > resources where I can look at high quality Python code and scripts?
>
> Well... Not everything is 'high quality' in it[1], but why not start
> with the stdlib ? Most of it is pure Python, opensource code and is
> already installed on your machine, isn't it ?-)
>
> [1] IIRC, last time I had a look at the zipfile module's code, it was
> more of a Q&D hack than anything else - now as long as it works fine for
> what I do with it and I don't have to maintain it, well, that's fine.
>
> > I've spent some time athttp://code.activestate.com/recipes/but am
> > concerned that the quality of what is posted there can be somewhat hit
> > and miss.
>
> Indeed.
>
> >  What I have in mind is a site like cpan, where one can look
> > at the actual source code of many of the modules and learn a thing or
> > two about idiomatic Perl programming from studying the better ones.
> > Any sites like that for Python?
>
> Lurking here is probably a good way to see a lot of code reviews. And
> even possibly to submit snippets of your own code to review. Some (if
> not most) of us here like to show how good we are at improving the poor
> newbies code !-)

Many thanks, Mike and Bruno,

The resources you mention are exactly the kind of stuff I was looking
for. Soon enough I hope to give all of you many chances to improve
this poor newbie's code...!

- Robean

--
http://mail.python.org/mailman/listinfo/python-list