Re: solr, snippets and stored field in nutch...

Mike Klaas Thu, 11 Oct 2007 16:19:52 -0700

First, it should be noted that I am not an expert in Nutch'sarchiture. I do think I understand what is being said there, however.

Nutch is a distributed web search engine, and uses lucene as aindexing component. It is free to use external data structures tostore data, and can store the index on a different machine than thecontents are stored. They can be updated independently.

One reason why this is more efficient is that in a distributedarchitecture, more documents are retrieved over the system than areeventually summarized and output. It makes no sense to shovel aroundthe contents of all these documents if summaries are only beingreturned for the top 10 over the whole system.

But Nutch is still storing the contents _somewhere_. They haven'tfound a magical technique that makes this need disappear.

So, does an external store make sense for Solr? Well, unlike Nutch,Solr is a solitary unit. If you ask for 10 docs returned, withsummaries, all of their contents are going to have to be retrieved.There aren't any advantages to storing the contents in a separatedata structure (which will be the same size).

Now, if you are using Solr in a large-scale distributed federatedway, then you can replicate Nutch's strategy by storing the index inone Solr index, and the contents in another. This could also yieldbenefits in a single-machine context if your code access many moredocuments than it wants summarized.

Keep in mind also that Solr has facilities to help you manage thesize of the content store. Are you stripping your contents to theirbare minima (removing HTML, etc)? Are you using a compressed textfield (highly recommended for this kind of data)?

Believe me, if I found that there was a way of providing summarieswithout storing doc contents, I would pee my pants with happiness andit would be in Solr faster than you can say "diaper".


cheers,
-Mike

On 11-Oct-07, at 3:48 PM, Ravish Bhagdev wrote:

Hey guys,

Checkout this thread I opened on nutch mailing list.  Looks like Solr
can benefit from reusing Nutch's "segment" based storage strategy for
efficiency in returning snippets, summaries etc without using Lucene
stored fields?

Was this considered before?

Ravish

---------- Forwarded message ----------
From: Dennis Kubes <[EMAIL PROTECTED]>
Date: Oct 11, 2007 11:27 PM
Subject: Re: snippets and stored field in nutch...
To: [EMAIL PROTECTED]


The reason it is stored in the segments instead of index to allow
summarizers to be run on the content of hits to produce the summaries
that appear in the search results.  Summarizers are pluggable and the
actual content used to produce the summary can change.  And summaries
can be changed without re-fetching or re-indexing.  If a summary were
stored in the index, re-indexing would have to occur to make changes.

Also the way the search process works, Nutch returns hits (basically
document ids).  These hits are then sorted and deduped and the best x

number (usually 10) returned. For only these 10 best hits, hitdetails

(fields in the index) and summaries are retrieved.  So there is

something to be said about the amount of data being pushed over thenetwork.


Dennis Kubes

Ravish Bhagdev wrote:

Ah, I see, didn't know that, Thanks!

Interesting that nutch stores it in a different structure (segments)
and doesn't reuse Lucene strategy of storing within index.  Any
particular reason why?  Is there any other use of "Segments" data
structure except to return snippets?

Cheers,
Ravish

On 10/11/07, John H. Lee <[EMAIL PROTECTED]> wrote:

Hi Ravish.

You are correct that Nutch does not store document content in the
Lucene index. The content *is* stored in the Nutch segment, which is
where snippets come from.

Hope this helps.

-J


On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:

Hey All,

Am I right in believing that in Lucene/Nutch, to be able to return
content or snippet to a search query, the field to be returnedhas to
be stored?

AFAIK, by default, Nutch dose not store the document field, am I
right?  If so, how does it manage to return snippets?  Wouldn't the
index be quite huge if nutch were storing document field bydefault?
I will appreciate any help/comments as I'm bit lost with this.

Ravi

Re: solr, snippets and stored field in nutch...

Reply via email to