Re: [ANNOUNCE] Web Crawler

Dominique Bejean Wed, 02 Mar 2011 03:21:15 -0800

Lukas,

I am thinking about it but no decision yet.

Anyway, in next release, I will provide source code of pipeline stagesand connectors as samples.


Dominique

Le 02/03/11 10:01, Lukáš Vlček a écrit :

Hi,

is there any plan to open source it?

Regards,
Lukas

[OT] I tried HuriSearch, input "Java" into search field, it returned alot of references to coldfusion error pages. May be a recrawl would help?

On Wed, Mar 2, 2011 at 1:25 AM, Dominique Bejean<dominique.bej...@eolya.fr <mailto:dominique.bej...@eolya.fr>> wrote:


    Hi,

    I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
    Web Crawler. It includes :

      * a crawler
      * a document processing pipeline
      * a solr indexer

    The crawler has a web administration in order to manage web sites
    to be crawled. Each web site crawl is configured with a lot of
    possible parameters (no all mandatory) :

      * number of simultaneous items crawled by site
      * recrawl period rules based on item type (html, PDF, …)
      * item type inclusion / exclusion rules
      * item path inclusion / exclusion / strategy rules
      * max depth
      * web site authentication
      * language
      * country
      * tags
      * collections
      * ...

    The pileline includes various ready to use stages (text
    extraction, language detection, Solr ready to index xml writer, ...).

    All is very configurable and extendible either by scripting or
    java coding.

    With scripting technology, you can help the crawler to handle
    javascript links or help the pipeline to extract relevant title
    and cleanup the html pages (remove menus, header, footers, ..)

    With java coding, you can develop your own pipeline stage stage

    The Crawl Anywhere web site provides good explanations and screen
    shots. All is documented in a wiki.

    The current version is 1.1.4. You can download and try it out from
    here : www.crawl-anywhere.com <http://www.crawl-anywhere.com>


    Regards

    Dominique

Re: [ANNOUNCE] Web Crawler

Reply via email to