Re: crawler feed?

Thorsten Scherler Wed, 07 Feb 2007 03:27:11 -0800

On Wed, 2007-02-07 at 11:09 +0100, rubdabadub wrote:
> Hi:
> 
> Are there relatively stand-alone crawler that are
> suitable/customizable for Solr? has anyone done any trials.. I have
> seen some discussion about coocon crawler.. was that successfull?


http://wiki.apache.org/solr/SolrForrest

I am using this approach in a custom project that is cocoon based and is
working very fine. However cocoons crawler is not standalone but using
the cocoon cli. I am using the solr/forrest plugin for the commit and
dispatching the update. The indexing transformation in the plugin is a
wee bit different then the one in my project since I needed to extract
more information from the documents to create better filters.

However since the cocoon cli is not anymore in 2.2 (cocoon-trunk) and
forrest uses this as its main component, I am keen to write a simple
crawler that could be reused for cocoon, forrest, solr, nutch, ...

I may will start something pretty soon (I guess I will open a project in
Apache Labs) and will keep this list informed. My idea is to write
simple crawler which could be easily extended by plugins. So if a
project/app needs special processing for a crawled url one could write a
plugin to implement the functionality. A solr plugin for this crawler
would be very simple, basically it would parse the e.g. html page and
dispatches an update command for the extracted fields. I think one
should try to reuse much code from nutch as possible for this parsing.

If somebody is interested in such a standalone crawler project, I
welcome any help, ideas, suggestion, feedback and/or questions.

salu2
-- 
Thorsten Scherler                       thorsten.at.apache.org
Open Source Java & XML      consulting, training and solutions

Re: crawler feed?

Reply via email to