On Wed, 18 May 2011 07:06:07 pm Albert-Jan Roskam wrote: > Hello, > > How can I walk (as in os.walk) or glob a website?
If you're on Linux, use wget or curl. If you're on Mac, you can probably install them using MacPorts. If you're on Windows, you have my sympathies. *wink* > I want to download > all the pdfs from a website (using urllib.urlretrieve), This first part is essentially duplicating wget or curl. The basic algorithm is: - download a web page - analyze that page for links (such <a href=...> but possibly also others) - decide whether you should follow each link and download that page - repeat until there's nothing left to download, the website blocks your IP address, or you've got everything you want except wget and curl already do 90% of the work. If the webpage requires Javascript to make things work, wget or curl can't help. I believe there is a Python library called Mechanize to help with that. For dealing with real-world HTML (also known as "broken" or "completely f***ed" HTML, please excuse the self-censorship), the library BeautifulSoup may be useful. Before doing any mass downloading, please read this: http://lethain.com/an-introduction-to-compassionate-screenscraping/ > extract > certain figures (using pypdf- is this flexible enough?) and make some > statistics/graphs from those figures (using rpy and R). I forgot what > the process of 'automatically downloading' is called again, something > that sounds like 'whacking' (??) Sometimes called screen or web scraping, recursive downloading, or copyright-infringement *wink* http://en.wikipedia.org/wiki/Web_scraping -- Steven D'Aprano _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor