Re: [Tutor] Accessing a Website

Steven D'Aprano Thu, 12 Jul 2012 17:29:45 -0700

Fred G wrote:

With the exception of step 6, I'm not quite sure how to do this in Python.
 Is it very complicated to write a script that logs onto a website that
requires a user name and password that I have, and then repeatedly enters
names and gets their associated id's that we want?

Python comes with some libraries for downloading web resources, including webpages. But if you have to interactive with the web page, such as enteringnames into a search field, your best bet is the third-party library mechanize.I have never used it, but I have never heard anything but good things about it.


http://pypi.python.org/pypi/mechanize/

http://www.ibm.com/developerworks/linux/library/l-python-mechanize-beautiful-soup/index.html

I used to work at a
cancer lab where we decided we couldn't do this kind of thing to search
PubMed, and that a human would be more accurate even though our criteria
was simply (is there survival data?).  I don't think that this has to be
the case here, but would greatly appreciate any guidance.

In general, web-scraping is fraught with problems. Web sites are written byignorant code-monkeys who can barely spell HTML, or worse, too-clever-by-farweb designers who write too-clever, subtly broken code that only works withInternet Explorer (and if you are lucky, Firefox). Or they stick everything inJavascript, or worse, Flash.

And often the web server tries to prevent automated tools from fetchinginformation. Or there may be legal barriers, where something which isperfectly legal if *you* do it becomes (allegedly) illegal if an automatedscript does it.

So I can perfectly understand why a conservative, risk-adverse universitymight prefer to have a human being mechanically fetch the data.

And yet, with work it is possible to code around nearly all these issues.Using tools like mechanize and BeautifulSoup, faking the user-agent string,and a few other techniques, most non-Flash non-Javascript sites can besuccessfully web-scraped. Even the legal issue can be coded around by addingsome human interaction to the script, so that it is not an *automated* script,while still keeping most of the benefits of automated scraping.


Don't abuse the privilege:

- obey robots.txt
- obey the site's terms and conditions
- obey copyright law
- make a temporary cache of pages you need to re-visit[1]
- give real human visitors priority
- limit your download rate to something reasonable
- pause between requests so you aren't hitting the server at an
  unreasonable rate
- in general, don't be a dick and disrupt the normal working of the
  server or website with your script.

[1] I am aware of the irony that this is theoretically forbidden bycopyright. Nevertheless, it is the right thing to do, both technically andethically.




--
Steven

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Accessing a Website

Reply via email to