On Wed, 2007-02-07 at 11:09 +0100, rubdabadub wrote: > Hi: > > Are there relatively stand-alone crawler that are > suitable/customizable for Solr? has anyone done any trials.. I have > seen some discussion about coocon crawler.. was that successfull?
http://wiki.apache.org/solr/SolrForrest I am using this approach in a custom project that is cocoon based and is working very fine. However cocoons crawler is not standalone but using the cocoon cli. I am using the solr/forrest plugin for the commit and dispatching the update. The indexing transformation in the plugin is a wee bit different then the one in my project since I needed to extract more information from the documents to create better filters. However since the cocoon cli is not anymore in 2.2 (cocoon-trunk) and forrest uses this as its main component, I am keen to write a simple crawler that could be reused for cocoon, forrest, solr, nutch, ... I may will start something pretty soon (I guess I will open a project in Apache Labs) and will keep this list informed. My idea is to write simple crawler which could be easily extended by plugins. So if a project/app needs special processing for a crawled url one could write a plugin to implement the functionality. A solr plugin for this crawler would be very simple, basically it would parse the e.g. html page and dispatches an update command for the extracted fields. I think one should try to reuse much code from nutch as possible for this parsing. If somebody is interested in such a standalone crawler project, I welcome any help, ideas, suggestion, feedback and/or questions. salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java & XML consulting, training and solutions