Hi, is there any plan to open source it?
Regards, Lukas [OT] I tried HuriSearch, input "Java" into search field, it returned a lot of references to coldfusion error pages. May be a recrawl would help? On Wed, Mar 2, 2011 at 1:25 AM, Dominique Bejean <dominique.bej...@eolya.fr>wrote: > Hi, > > I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web > Crawler. It includes : > > * a crawler > * a document processing pipeline > * a solr indexer > > The crawler has a web administration in order to manage web sites to be > crawled. Each web site crawl is configured with a lot of possible parameters > (no all mandatory) : > > * number of simultaneous items crawled by site > * recrawl period rules based on item type (html, PDF, …) > * item type inclusion / exclusion rules > * item path inclusion / exclusion / strategy rules > * max depth > * web site authentication > * language > * country > * tags > * collections > * ... > > The pileline includes various ready to use stages (text extraction, > language detection, Solr ready to index xml writer, ...). > > All is very configurable and extendible either by scripting or java coding. > > With scripting technology, you can help the crawler to handle javascript > links or help the pipeline to extract relevant title and cleanup the html > pages (remove menus, header, footers, ..) > > With java coding, you can develop your own pipeline stage stage > > The Crawl Anywhere web site provides good explanations and screen shots. > All is documented in a wiki. > > The current version is 1.1.4. You can download and try it out from here : > www.crawl-anywhere.com > > > Regards > > Dominique > >