Re: Solr and Nutch/Droids - to use or not to use?

Otis Gospodnetic Wed, 16 Jun 2010 08:37:38 -0700

My quick feedback would be:
Try using Nutch first, because it is a more complete "platform".  From what I 
know, Droids is just the crawler with an in-memory queue + link extractor.  We 
did use it for crawling Lucene project sites (for the index on 
http://search-lucene.com/ ), but that is because the data volume is low, the 
crawl very narrow, scaling requirements low, etc.


 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: MitchK <[email protected]>
> To: [email protected]
> Sent: Wed, June 16, 2010 11:27:20 AM
> Subject: Solr and Nutch/Droids - to use or not to use?
> 
> 
Hello community, 

from several discussions about Solr and Nutch, I 
> got some questions for a
virtual web-search-engine. 
I know I've posted 
> this message to the mailing list a few days ago, but the
thread got injected 
> and at least I did not get any more postings about the
topic and so I try to 
> reopen it, hopefully no one gets upset here :-).
Please, bear with me. Thank 
> you.

The requirements: 
I. I need a scalable solution for a growing 
> index that becomes larger than
one machine can handle. If I add more 
> hardware, I want to linear improve the
performance. 

II. I want to use 
> technologies like the OPIC-algorithm (default algorithm in
Nutch) or PageRank 
> or... whatever is out there to improve the ranking of the
webpages. 
> 

III. I want to be able to easily add more fields to my documents. 
> Imagine
one retrives information from a webpage's content, than I want to 
> make it
searchable. 

IV. While fetching my data, I want to make 
> special-searches possible. For
example I want to retrive pictures from a 
> webpage and want to index
picture-related content into another search-index 
> plus I want to save a
small thumbnail of the picture itself. Btw: This is (as 
> far as I know) not
possible with solr, because solr was not intended to do 
> such special
indexing-logic. 

V. I want to use filter queries (i.e. 
> main-query "christopher lee" returns
1.5mio results, subquery "action" -> 
> the main-query would be a filter-query
and "action" would be the actual 
> query. So a search within search-results
would be easily made available). 
> 

VI. I want to be able to use different logics for different pages. Maybe 
> I
got a pool of 100 domains that I know better than others and I got 
> special
scripts that retrive more special information from those 100 domains. 
> Than I
want to apply my special logic to those 100 domains, but every other 
> domain
should use the default logic. 

----------------- 

The 
> project is only virtual. So why I am asking? 
I want to learn more about 
> websearch and I would like to make some new
experiences. 

What do I 
> know about Solr + Nutch: 
As it is said on lucidimagination.com, Solr + Nutch 
> does not scale if the
index is too large. 
The article was a little bit 
> older and I don't know whether this problem
gets fixed with the new 
> distributed abilities of Solr. 

Furthermore I don't want to index the 
> pages with nutch and reindex them with
solr. 
The only exception would be: 
> If the content of a webpage get's indexed by
nutch, I want to use the already 
> tokenized content of the body with some
Solr copyfield operations to extend 
> the search (i.e. making fuzzy search
possible). At the moment: I don't think 
> this is possible. 

I don't know much about the droids project and how 
> well it is documented. 
But from what I can read by some posts of Otis, it 
> seems to be usable as a
crawler-framework. 


Pros for Nutch are: It 
> is very scalable! Thanks to hadoop and MapReduce it
is a scaling-monster 
> (from what I've read). 

Cons: The search is not as rich as it is possible 
> with Solr. Extend Nutch's
search-abilities *seems* to be more complicated 
> than with Solr. Furthermore,
if I want to use Solr to search nutch's index, 
> looking at my requirements I
would need to reindex the whole thing - without 
> the benefits of Hadoop. 

What I don't know at the moment is, how it is 
> possible to use algorithms
like in II. mentioned with Solr. 

I hope 
> you understand the problem here - Solr *seems* to me as it would not
be the 
> best solution for a web-search-engine, because of scaling reasons 
> in
indexing. 


Where should I dive deeper? 
Solr + Droids? 
> 
Solr + Nutch? 
Nutch + howToExtendNutchToMakeSearchBetter? 
> 


Thanks for the discussion! 
- Mitch
-- 
View this message 
> in context: 
> href="http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900069.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp900069p900069.html
Sent 
> from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Nutch/Droids - to use or not to use?

Reply via email to