Re: A new CLI (was Re: [RT] The environment abstraction, part II)

Upayavira Mon, 03 Apr 2006 01:00:47 -0700

David Crossley wrote:
> Upayavira wrote:
>> Sylvain Wallez wrote:
>>> Carsten Ziegeler wrote:
>>>> Sylvain Wallez wrote:
>>>>> Hmm... the current CLI uses Cocoon's links view to crawl the website. So
>>>>> although the new crawler can be based on servlets, it will assume these
>>>>> servlets to answer to a ?cocoon-view=links :-)
>>>>>     
>>>> Hmm, I think we don't need the links view in this case anymore. A simple
>>>>  HTML crawler should be enough as it will follow all links on the page.
>>>> The view would only make sense in the case where you don't output html
>>>> where the usual crawler tools would not work.
>>>>   
>>> In the case of Forrest, you're probably right. Now the links view also
>>> allows to follow links in pipelines producing something that's not HTML,
>>> such as PDF, SVG, WML, etc.
>>>
>>> We have to decide if we want to loose this feature.
> 
> I am not sure if we use this in Forrest. If not
> then we probably should be. 
> 
>> In my view, the whole idea of crawling (i.e. gathering links from pages)
>> is suboptimal anyway. For example, some sites don't directly link to all
>> pages (e.g. they are accessed via javascript, or whatever) so you get
>> pages missed.
>>
>> Were I to code a new CLI, whilst I would support crawling I would mainly
>> configure the CLI to get the list of pages to visit by calling one or
>> more URLs. Those URLs would specify the pages to generate.
>>
>> Thus, Forrest would transform its site.xml file into this list of pages,
>> and drive the CLI via that.
> 
> This is what we do do. We have a property
> "start-uri=linkmap.html"
> http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html
> (we actually use corresponding xml of course).
> 
> We define a few extra URIs in the Cocoon cli.xconf
> 
> There are issues of course. Sometimes we want to
> include directories of files that are not referenced
> in site.xml navigation. For my sites i just use a
> DirectoryGenerator to build an index page which feeds
> the crawler. Sometime that technique is not sufficent.
> 
> We also gather links from text files (e.g. CSS)
> using Chaperon. This works nicely but introduces
> some overhead.


This more or less confirms my suggested approach - allow crawling at the
'end-point' HTML, but more importantly, use a page/URL to identify the
pages to be crawled. The interesting thing from what you say is that
this page could itself be nothing more than HTML.

Regards, Upayavira

Re: A new CLI (was Re: [RT] The environment abstraction, part II)

Reply via email to