Thorsten Scherler wrote: > El lun, 03-04-2006 a las 12:34 +0100, Upayavira escribió: >> Thorsten Scherler wrote: >>> El lun, 03-04-2006 a las 09:00 +0100, Upayavira escribió: >>>> David Crossley wrote: >>>>> Upayavira wrote: >>>>>> Sylvain Wallez wrote: >>>>>>> Carsten Ziegeler wrote: >>>>>>>> Sylvain Wallez wrote: >>>>>>>>> Hmm... the current CLI uses Cocoon's links view to crawl the website. >>>>>>>>> So >>>>>>>>> although the new crawler can be based on servlets, it will assume >>>>>>>>> these >>>>>>>>> servlets to answer to a ?cocoon-view=links :-) >>>>>>>>> >>>>>>>> Hmm, I think we don't need the links view in this case anymore. A >>>>>>>> simple >>>>>>>> HTML crawler should be enough as it will follow all links on the page. >>>>>>>> The view would only make sense in the case where you don't output html >>>>>>>> where the usual crawler tools would not work. >>>>>>>> >>>>>>> In the case of Forrest, you're probably right. Now the links view also >>>>>>> allows to follow links in pipelines producing something that's not HTML, >>>>>>> such as PDF, SVG, WML, etc. >>>>>>> >>>>>>> We have to decide if we want to loose this feature. >>>>> I am not sure if we use this in Forrest. If not >>>>> then we probably should be. >>>>> >>>>>> In my view, the whole idea of crawling (i.e. gathering links from pages) >>>>>> is suboptimal anyway. For example, some sites don't directly link to all >>>>>> pages (e.g. they are accessed via javascript, or whatever) so you get >>>>>> pages missed. >>>>>> >>>>>> Were I to code a new CLI, whilst I would support crawling I would mainly >>>>>> configure the CLI to get the list of pages to visit by calling one or >>>>>> more URLs. Those URLs would specify the pages to generate. >>>>>> >>>>>> Thus, Forrest would transform its site.xml file into this list of pages, >>>>>> and drive the CLI via that. >>>>> This is what we do do. We have a property >>>>> "start-uri=linkmap.html" >>>>> http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html >>>>> (we actually use corresponding xml of course). >>>>> >>>>> We define a few extra URIs in the Cocoon cli.xconf >>>>> >>>>> There are issues of course. Sometimes we want to >>>>> include directories of files that are not referenced >>>>> in site.xml navigation. For my sites i just use a >>>>> DirectoryGenerator to build an index page which feeds >>>>> the crawler. Sometime that technique is not sufficent. >>>>> >>>>> We also gather links from text files (e.g. CSS) >>>>> using Chaperon. This works nicely but introduces >>>>> some overhead. >>>> This more or less confirms my suggested approach - allow crawling at the >>>> 'end-point' HTML, but more importantly, use a page/URL to identify the >>>> pages to be crawled. The interesting thing from what you say is that >>>> this page could itself be nothing more than HTML. >>> Well, yes and not really, since e.g. Chaperon is text based and no >>> markup. You need a lex-writer to generate links for the crawler. >> Yes. You misunderstand me I think. > > Yes, sorry I did misunderstood you. > >> Even if you use Chaperon etc to parse >> markup, there'd be no difficulty expressing the links that you found as >> an HTML page - one intended to be consumed by the CLI, not to be >> publically viewed. > > Well in the case of css you want them as well publically viewed but I > got your point. ;) > >> In fact, if it were written to disc, forrest would >> probably delete it afterwards. >> >>> Forrest actually is *not* aimed for html only support and one can think >>> of the situation that you want your site to be only txt (kind of a >>> book). Here you need to crawler the lex-rewriter outcome and follow the >>> links. >> Hopefully I've shown that I had understood that already :-) > > yeah ;) > >>> The current limitation of forrest regarding the crawler are IMO not >>> caused by the crawler design but rather by our (as in forrest) usage of >>> it. >> Yep, fair enough. But if the CLI is going to survive the shift that is >> happening in Cocoon trunk, something big needs to be done by someone. It >> cannot survive in its current form as the code it uses is changing >> almost beyond recognition. >> >> Heh, perhaps the Cocoon CLI should just be a Maven plugin. > > ...or forrest plugin. ;) This would makes it possible that cocoon, lenya > and forrest committer can help. > > Kind of http://svn.apache.org/viewcvs.cgi/lenya/sandbox/doco/ ;)
Well, in the end, it is he who implements that decides. Upayavira
