karl added the comment:
→ python
Python 2.7.5 (default, Mar 9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import robotparser
>>> rp = robotparser.RobotFileParser('http://somesite.test.site/robots.txt')
>>> rp.read()
>>>
Let's check the server logs:
127.0.0.1 - - [23/Jun/2014:08:44:37 +0900] "GET /robots.txt HTTP/1.0" 200 92
"-" "Python-urllib/1.17"
Robotparser by default was using in 2.* the Python-urllib/1.17 user agent which
is traditionally blocked by many sysadmins. A solution has been already
proposed above:
This is the proposed test for 3.4
import urllib.robotparser
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'MyUa/0.1')]
urllib.request.install_opener(opener)
rp = urllib.robotparser.RobotFileParser('http://localhost:9999')
rp.read()
The issue is not anymore about changing the lib, but just about documenting on
how to change the RobotFileParser default UA. We can change the title of this
issue if it's confusing. Or close it and open a new one for documenting what
makes it easier :)
Currently robotparser.py imports urllib user agent.
http://hg.python.org/cpython/file/7dc94337ef67/Lib/urllib/request.py#l364
It's a common failure we encounter when using urllib in general, including
robotparser.
As for wikipedia, they fixed their server side user agent sniffing, and do not
filter anymore python-urllib.
GET /robots.txt HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate, compress
Host: en.wikipedia.org
User-Agent: Python-urllib/1.17
HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 3161
Cache-control: s-maxage=3600, must-revalidate, max-age=0
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 5208
Content-Type: text/plain; charset=utf-8
Date: Sun, 22 Jun 2014 23:59:16 GMT
Last-modified: Tue, 26 Nov 2013 17:39:43 GMT
Server: Apache
Set-Cookie: GeoIP=JP:Tokyo:35.6850:139.7514:v4; Path=/; Domain=.wikipedia.org
Vary: X-Subdomain
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
X-Article-ID: 19292575
X-Cache: cp1065 miss (0), cp4016 hit (1), cp4009 frontend hit (215)
X-Content-Type-Options: nosniff
X-Language: en
X-Site: wikipedia
X-Varnish: 2529666795, 2948866481 2948865637, 4134826198 4130750894
Many other sites still do. :)
----------
versions: +Python 3.4 -Python 3.5
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue15851>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com