Re: [ANNOUNCE] Web Crawler

Rosa (Anuncios) Wed, 02 Mar 2011 00:36:36 -0800

Nice job!

It would be good to be able to extract specific data from a given pagevia XPATH though.


Regards,


Le 02/03/2011 01:25, Dominique Bejean a écrit :

Hi,
I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java WebCrawler. It includes :
   * a crawler
   * a document processing pipeline
   * a solr indexer
The crawler has a web administration in order to manage web sites tobe crawled. Each web site crawl is configured with a lot of possibleparameters (no all mandatory) :
   * number of simultaneous items crawled by site
   * recrawl period rules based on item type (html, PDF, …)
   * item type inclusion / exclusion rules
   * item path inclusion / exclusion / strategy rules
   * max depth
   * web site authentication
   * language
   * country
   * tags
   * collections
   * ...
The pileline includes various ready to use stages (text extraction,language detection, Solr ready to index xml writer, ...).
All is very configurable and extendible either by scripting or javacoding.
With scripting technology, you can help the crawler to handlejavascript links or help the pipeline to extract relevant title andcleanup the html pages (remove menus, header, footers, ..)
With java coding, you can develop your own pipeline stage stage
The Crawl Anywhere web site provides good explanations and screenshots. All is documented in a wiki.
The current version is 1.1.4. You can download and try it out fromhere : www.crawl-anywhere.com
Regards

Dominique

Re: [ANNOUNCE] Web Crawler

Reply via email to