David Crossley wrote: > Upayavira wrote: >> Sylvain Wallez wrote: >>> Carsten Ziegeler wrote: >>>> Sylvain Wallez wrote: >>>>> Hmm... the current CLI uses Cocoon's links view to crawl the website. So >>>>> although the new crawler can be based on servlets, it will assume these >>>>> servlets to answer to a ?cocoon-view=links :-) >>>>> >>>> Hmm, I think we don't need the links view in this case anymore. A simple >>>> HTML crawler should be enough as it will follow all links on the page. >>>> The view would only make sense in the case where you don't output html >>>> where the usual crawler tools would not work. >>>> >>> In the case of Forrest, you're probably right. Now the links view also >>> allows to follow links in pipelines producing something that's not HTML, >>> such as PDF, SVG, WML, etc. >>> >>> We have to decide if we want to loose this feature. > > I am not sure if we use this in Forrest. If not > then we probably should be. > >> In my view, the whole idea of crawling (i.e. gathering links from pages) >> is suboptimal anyway. For example, some sites don't directly link to all >> pages (e.g. they are accessed via javascript, or whatever) so you get >> pages missed. >> >> Were I to code a new CLI, whilst I would support crawling I would mainly >> configure the CLI to get the list of pages to visit by calling one or >> more URLs. Those URLs would specify the pages to generate. >> >> Thus, Forrest would transform its site.xml file into this list of pages, >> and drive the CLI via that. > > This is what we do do. We have a property > "start-uri=linkmap.html" > http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html > (we actually use corresponding xml of course). > > We define a few extra URIs in the Cocoon cli.xconf > > There are issues of course. Sometimes we want to > include directories of files that are not referenced > in site.xml navigation. For my sites i just use a > DirectoryGenerator to build an index page which feeds > the crawler. Sometime that technique is not sufficent. > > We also gather links from text files (e.g. CSS) > using Chaperon. This works nicely but introduces > some overhead.
This more or less confirms my suggested approach - allow crawling at the 'end-point' HTML, but more importantly, use a page/URL to identify the pages to be crawled. The interesting thing from what you say is that this page could itself be nothing more than HTML. Regards, Upayavira
