Hi Steven,
From: Steven D'Aprano <st...@pearwood.info> To: tutor@python.org Sent: Wed, May 18, 2011 1:13:17 PM Subject: Re: [Tutor] can I walk or glob a website? On Wed, 18 May 2011 07:06:07 pm Albert-Jan Roskam wrote: > Hello, > > How can I walk (as in os.walk) or glob a website? If you're on Linux, use wget or curl. ===> Thanks for your reply. I tried wget, which seems to be a very handy tool. However, it doesn't work on this particular site. I tried wget -e robots=off -r -nc --no-parent -l6 -A.pdf 'http://www.landelijkregisterkinderopvang.nl/' (the quotes are there because I originally used a deeper link that contains ampersands). I also tested it on python.org, where it does work. Adding -e robots=off didn't work either. Do you think this could be a protection from the administrator? If you're on Mac, you can probably install them using MacPorts. If you're on Windows, you have my sympathies. *wink* > I want to download > all the pdfs from a website (using urllib.urlretrieve), This first part is essentially duplicating wget or curl. The basic algorithm is: - download a web page - analyze that page for links (such <a href=...> but possibly also others) - decide whether you should follow each link and download that page - repeat until there's nothing left to download, the website blocks your IP address, or you've got everything you want except wget and curl already do 90% of the work. If the webpage requires Javascript to make things work, wget or curl can't help. I believe there is a Python library called Mechanize to help with that. For dealing with real-world HTML (also known as "broken" or "completely f***ed" HTML, please excuse the self-censorship), the library BeautifulSoup may be useful. Before doing any mass downloading, please read this: http://lethain.com/an-introduction-to-compassionate-screenscraping/ > extract > certain figures (using pypdf- is this flexible enough?) and make some > statistics/graphs from those figures (using rpy and R). I forgot what > the process of 'automatically downloading' is called again, something > that sounds like 'whacking' (??) Sometimes called screen or web scraping, recursive downloading, or copyright-infringement *wink* http://en.wikipedia.org/wiki/Web_scraping -- Steven D'Aprano _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor