Thorsten: First of all I read your lab idea with great interest as I am in need of such crawler. However there are certain things that I like to discuss. I am not sure what forum will be appropriate for this but I will do my idea shooting here first then please tell me where should I post further comments.
A vertical search engine that will focus on a specific set of data i.e use solr for example cos it provides the maximum field flexibility would greatly benefit from such crawler. i.e the next big technorati or the next big event finding solution can use your crawler to crawl feeds using a feed-plugin (maybe nutch plugins) or scrape websites for event info using some x-path/xquery stuff (personally I think xpath is a pain in the a... :-) What I worry about is those issue that has to deal with - updating crawls - how many threads per host - scale etc. All the maintainers headaches! I know you will use as much code as you can from Nutch plus are not planning to re-invent the wheel. But wouldn't be much easier to jump into Sami's idea and make it better and more stand-alone and still benefit from the Nutch community? I wonder wouldn't it be easy to push/purse a route where nutch crawler becomes a standalone crawler? no? I read a post about it on the list. I would like to hear more about how your plan will evolve in terms of druid and why not join forces with Sami and co.? Regards On 2/7/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote:
On Wed, 2007-02-07 at 18:03 +0200, Sami Siren wrote: > rubdabadub wrote: > > Hi: > > > > Are there relatively stand-alone crawler that are > > suitable/customizable for Solr? has anyone done any trials.. I have > > seen some discussion about coocon crawler.. was that successfull? > > There's also integration path available for Nutch[1] that i plan to > integrate after 0.9.0 is out. sounds very nice, I just finished to read. Thanks. Today a submitted a proposal for an Apache Labs project called Apache Druids. http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser Basic idea is to create a flexible crawler framework. The core should be a simple crawler which could be easily expended by plugins. So if a project/app needs special processing for a crawled url one could write a plugin to implement the functionality. salu2 > > -- > Sami Siren > > [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html -- Thorsten Scherler thorsten.at.apache.org Open Source Java & XML consulting, training and solutions