Re: [Tutor] can I walk or glob a website?

Dave Angel Wed, 18 May 2011 02:56:56 -0700

On 01/-10/-28163 02:59 PM, Alan Gauld wrote:


"Albert-Jan Roskam" <fo...@yahoo.com> wrote

How can I walk (as in os.walk) or glob a website?


I don't think there is a way to do that via the web.
Of course if you have access to the web servers filesystem you can use
os.walk to do it as for any other filesystem, but I don't think its
generally possible over http. (And indeed it shouldn''t be for very good
security reasons!)

OTOH I've been wrong before! :-)

It has to be (more or less) possible. That's what google does for theirsearch engine.


Three broad issues.

1) Are you violating the terms of service of such a web site? Are yougoing to be doing this seldom enough that the bandwidth used won't be aDOS attack? Are there copyrights to the material you plan to download?Is the website protected by a login, by cookies, or a VPN? Does thewebsite present a different view to different browsers, different OS's,or different target domains?

2) Websites vary enormously in their adherence to standards. There aremany such standards, and browsers tend to be very tolerant of bugs inthe site which will be painful for you to accomodate. And some of theextensions/features are very hard to parse, such as flash. Others, suchas javascript, can make it hard to do it statically.

3) How important is it to do it reliably? Your code may work perfectlywith a particular website, and next week they'll make a change whichbreaks your code entirely. Are you willing to rework the code each timethat happens?

Many sites have API's that you can use to access them. Sometimes thisis a better answer.

With all of that said, I'll point you to Beautiful Soup, as a librarythat'll parse a page of moderately correct html and give you theelements of it. If it's a static page, you can then walk the elementsof the tree that Beautiful Soup gives you, and find all the content thatinterests you. You can also find all the web pages that the first onerefers to, and recurse on that.

Notice that you need to limit your scope, since many websites havedirect and indirect links to most of the web. For example, you mightonly recurse into links that refer to the same domain. For manywebsites, that means you won't get it all. So you may want to supply alist of domains and/or subdomains that you're willing to recurse into.


See   http://pypi.python.org/pypi/BeautifulSoup/3.2.0

DaveA

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] can I walk or glob a website?

Reply via email to