On Wed, May 18, 2011 at 2:06 AM, Albert-Jan Roskam <fo...@yahoo.com> wrote:
> Hello, > > How can I walk (as in os.walk) or glob a website? I want to download all > the pdfs from a website (using urllib.urlretrieve), extract certain figures > (using pypdf- is this flexible enough?) and make some statistics/graphs from > those figures (using rpy and R). I forgot what the process of 'automatically > downloading' is called again, something that sounds like 'whacking' (??) > > I think the word you're looking for is "scraping". I actually did something (roughly) similar a few years ago, to download a collection of free Russian audiobooks for my father-in-law (an avid reader who was quickly going blind.) I crawled the site looking for .mp3 files, then returned a tree from which I could select files to be downloaded. It's horribly crude, in retrospect, and I'm embarrassed re-reading my code - but if you're interested I can forward it (if only as an example of what _not_to do.)
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor