Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread Andre Engels
On Fri, Jan 23, 2009 at 11:25 AM, Andre Engels wrote: > On Fri, Jan 23, 2009 at 10:37 AM, amit sethi > wrote: >> so is there a way around that problem ?? > > Ok, I have done some checking around, and it seems that the Wikipedia > server is giving a return code of 403 (forbidden), but still givin

Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread Alan Gauld
"Kent Johnson" wrote Rather than editing the existing code and making it non standard why not subclass robotparser: That won't work, it is urllib.URLOpener() that he is patching and Sorry, yes I misread that post as modifying robotparser, it should have been URLOpener. But... robotparse

Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread Kent Johnson
On Fri, Jan 23, 2009 at 5:37 AM, Andre Engels wrote: > Looking further I found that a 'cleaner' way to make the same change > is to add to the code of URLopener (outside any method): > > version = '' You can do this without modifying the standard library source, by import urllib urllib.URLop

Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread Kent Johnson
On Fri, Jan 23, 2009 at 6:23 AM, Alan Gauld wrote: > Rather than editing the existing code and making it non standard > why not subclass robotparser: > > class WP_RobotParser(robotparser): > def __init__(self, *args, *kwargs): > robotparser.__init__(self, *args, *kwargs) > self.

Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread Alan Gauld
"Andre Engels" wrote developers of Wikimedia why this is done, but for now you can resolve this by editing robotparser.py in the following way: In the __init__ of the class URLopener, add the following at the end: self.addheaders = [header for header in self.addheaders if header[0] != "Us

Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread Andre Engels
On Fri, Jan 23, 2009 at 12:07 PM, amit sethi wrote: > well thanks ... it worked well ... but robotparser is in urllib isn't there > a module like robotparser in > urllib2 You'll have to ask someone else about that part... -- André Engels, andreeng...@gmail.com _

Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread amit sethi
well thanks ... it worked well ... but robotparser is in urllib isn't there a module like robotparser in urllib2 On Fri, Jan 23, 2009 at 3:55 PM, Andre Engels wrote: > On Fri, Jan 23, 2009 at 10:37 AM, amit sethi > wrote: > > so is there a way around that problem ?? > > Ok, I have done some che

Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread Andre Engels
On Fri, Jan 23, 2009 at 11:25 AM, Andre Engels wrote: > In the __init__ of the class URLopener, add the following at the end: > > self.addheaders = [header for header in self.addheaders if header[0] > != "User-Agent"] + [('User-Agent', '')] > > (probably > > self.addheaders = [('User-Agent', '')]

Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread Andre Engels
On Fri, Jan 23, 2009 at 10:37 AM, amit sethi wrote: > so is there a way around that problem ?? Ok, I have done some checking around, and it seems that the Wikipedia server is giving a return code of 403 (forbidden), but still giving the page - which I think is weird behaviour. I will check with t

Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread amit sethi
so is there a way around that problem ?? On Fri, Jan 23, 2009 at 2:25 PM, Andre Engels wrote: > On Fri, Jan 23, 2009 at 9:09 AM, amit sethi > wrote: > > Well that is interesting but why should that happen in case I am using a > > different User Agent because I tried doing > > status=rp.can_fet

Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread Andre Engels
On Fri, Jan 23, 2009 at 9:09 AM, amit sethi wrote: > Well that is interesting but why should that happen in case I am using a > different User Agent because I tried doing > status=rp.can_fetch('Mozilla/5.0', > "http://en.wikipedia.org/wiki/Sachin_Tendulkar";) > but even that returns false > Is th

Re: [Tutor] fetching wikipedia articles

2009-01-23 Thread amit sethi
Well that is interesting but why should that happen in case I am using a different User Agent because I tried doing status=rp.can_fetch('Mozilla/5.0', " http://en.wikipedia.org/wiki/Sachin_Tendulkar";) but even that returns false Is there something wrong with the syntax , Is there a catch that i d

Re: [Tutor] fetching wikipedia articles

2009-01-22 Thread Andre Engels
On Thu, Jan 22, 2009 at 6:08 PM, amit sethi wrote: > hi , I need help as to how i can fetch a wikipedia article i tried changing > my user agent but it did not work . Although as far as my knowledge of > robots.txt goes , looking at en.wikipedia.org/robots.txt it does not seem it > should block a