My quick feedback would be: Try using Nutch first, because it is a more complete "platform". From what I know, Droids is just the crawler with an in-memory queue + link extractor. We did use it for crawling Lucene project sites (for the index on http://search-lucene.com/ ), but that is because the data volume is low, the crawl very narrow, scaling requirements low, etc.
Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: MitchK <mitc...@web.de> > To: solr-user@lucene.apache.org > Sent: Wed, June 16, 2010 11:27:20 AM > Subject: Solr and Nutch/Droids - to use or not to use? > > Hello community, from several discussions about Solr and Nutch, I > got some questions for a virtual web-search-engine. I know I've posted > this message to the mailing list a few days ago, but the thread got injected > and at least I did not get any more postings about the topic and so I try to > reopen it, hopefully no one gets upset here :-). Please, bear with me. Thank > you. The requirements: I. I need a scalable solution for a growing > index that becomes larger than one machine can handle. If I add more > hardware, I want to linear improve the performance. II. I want to use > technologies like the OPIC-algorithm (default algorithm in Nutch) or PageRank > or... whatever is out there to improve the ranking of the webpages. > III. I want to be able to easily add more fields to my documents. > Imagine one retrives information from a webpage's content, than I want to > make it searchable. IV. While fetching my data, I want to make > special-searches possible. For example I want to retrive pictures from a > webpage and want to index picture-related content into another search-index > plus I want to save a small thumbnail of the picture itself. Btw: This is (as > far as I know) not possible with solr, because solr was not intended to do > such special indexing-logic. V. I want to use filter queries (i.e. > main-query "christopher lee" returns 1.5mio results, subquery "action" -> > the main-query would be a filter-query and "action" would be the actual > query. So a search within search-results would be easily made available). > VI. I want to be able to use different logics for different pages. Maybe > I got a pool of 100 domains that I know better than others and I got > special scripts that retrive more special information from those 100 domains. > Than I want to apply my special logic to those 100 domains, but every other > domain should use the default logic. ----------------- The > project is only virtual. So why I am asking? I want to learn more about > websearch and I would like to make some new experiences. What do I > know about Solr + Nutch: As it is said on lucidimagination.com, Solr + Nutch > does not scale if the index is too large. The article was a little bit > older and I don't know whether this problem gets fixed with the new > distributed abilities of Solr. Furthermore I don't want to index the > pages with nutch and reindex them with solr. The only exception would be: > If the content of a webpage get's indexed by nutch, I want to use the already > tokenized content of the body with some Solr copyfield operations to extend > the search (i.e. making fuzzy search possible). At the moment: I don't think > this is possible. I don't know much about the droids project and how > well it is documented. But from what I can read by some posts of Otis, it > seems to be usable as a crawler-framework. Pros for Nutch are: It > is very scalable! Thanks to hadoop and MapReduce it is a scaling-monster > (from what I've read). Cons: The search is not as rich as it is possible > with Solr. Extend Nutch's search-abilities *seems* to be more complicated > than with Solr. Furthermore, if I want to use Solr to search nutch's index, > looking at my requirements I would need to reindex the whole thing - without > the benefits of Hadoop. What I don't know at the moment is, how it is > possible to use algorithms like in II. mentioned with Solr. I hope > you understand the problem here - Solr *seems* to me as it would not be the > best solution for a web-search-engine, because of scaling reasons > in indexing. Where should I dive deeper? Solr + Droids? > Solr + Nutch? Nutch + howToExtendNutchToMakeSearchBetter? > Thanks for the discussion! - Mitch -- View this message > in context: > href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900069.html" > > target=_blank > >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900069.html Sent > from the Solr - User mailing list archive at Nabble.com.