Re: crawler feed?

rubdabadub Wed, 07 Feb 2007 04:01:10 -0800

Thorsten:

Thank you very much for the update.


On 2/7/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote:

On Wed, 2007-02-07 at 11:09 +0100, rubdabadub wrote:
> Hi:
>
> Are there relatively stand-alone crawler that are
> suitable/customizable for Solr? has anyone done any trials.. I have
> seen some discussion about coocon crawler.. was that successfull?

http://wiki.apache.org/solr/SolrForrest

I am using this approach in a custom project that is cocoon based and is
working very fine. However cocoons crawler is not standalone but using
the cocoon cli. I am using the solr/forrest plugin for the commit and
dispatching the update. The indexing transformation in the plugin is a
wee bit different then the one in my project since I needed to extract
more information from the documents to create better filters.

However since the cocoon cli is not anymore in 2.2 (cocoon-trunk) and
forrest uses this as its main component, I am keen to write a simple
crawler that could be reused for cocoon, forrest, solr, nutch, ...

I may will start something pretty soon (I guess I will open a project in
Apache Labs) and will keep this list informed. My idea is to write
simple crawler which could be easily extended by plugins. So if a
project/app needs special processing for a crawled url one could write a
plugin to implement the functionality. A solr plugin for this crawler
would be very simple, basically it would parse the e.g. html page and
dispatches an update command for the extracted fields. I think one
should try to reuse much code from nutch as possible for this parsing.


I have seen some discussion regarding nutch crawler. I think a standalone
crawler would be more desirable .. as you pointed out one could extend such
crawler via plugins. Is seems difficult to "rip nutch" crawler as a
standalone crawler? no?
Cos you would want as much as possible "same code base" no? I also think such
crawler is interesting in vertical search engine space. So Nutch 0.7
could be good target no?

Regards

Re: crawler feed?

Reply via email to