Fred G wrote:

With the exception of step 6, I'm not quite sure how to do this in Python.
 Is it very complicated to write a script that logs onto a website that
requires a user name and password that I have, and then repeatedly enters
names and gets their associated id's that we want?


Python comes with some libraries for downloading web resources, including web pages. But if you have to interactive with the web page, such as entering names into a search field, your best bet is the third-party library mechanize. I have never used it, but I have never heard anything but good things about it.

http://pypi.python.org/pypi/mechanize/

http://www.ibm.com/developerworks/linux/library/l-python-mechanize-beautiful-soup/index.html


I used to work at a
cancer lab where we decided we couldn't do this kind of thing to search
PubMed, and that a human would be more accurate even though our criteria
was simply (is there survival data?).  I don't think that this has to be
the case here, but would greatly appreciate any guidance.


In general, web-scraping is fraught with problems. Web sites are written by ignorant code-monkeys who can barely spell HTML, or worse, too-clever-by-far web designers who write too-clever, subtly broken code that only works with Internet Explorer (and if you are lucky, Firefox). Or they stick everything in Javascript, or worse, Flash.

And often the web server tries to prevent automated tools from fetching information. Or there may be legal barriers, where something which is perfectly legal if *you* do it becomes (allegedly) illegal if an automated script does it.

So I can perfectly understand why a conservative, risk-adverse university might prefer to have a human being mechanically fetch the data.

And yet, with work it is possible to code around nearly all these issues. Using tools like mechanize and BeautifulSoup, faking the user-agent string, and a few other techniques, most non-Flash non-Javascript sites can be successfully web-scraped. Even the legal issue can be coded around by adding some human interaction to the script, so that it is not an *automated* script, while still keeping most of the benefits of automated scraping.

Don't abuse the privilege:

- obey robots.txt
- obey the site's terms and conditions
- obey copyright law
- make a temporary cache of pages you need to re-visit[1]
- give real human visitors priority
- limit your download rate to something reasonable
- pause between requests so you aren't hitting the server at an
  unreasonable rate
- in general, don't be a dick and disrupt the normal working of the
  server or website with your script.




[1] I am aware of the irony that this is theoretically forbidden by copyright. Nevertheless, it is the right thing to do, both technically and ethically.



--
Steven

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to