Re: solr, snippets and stored field in nutch...

Ravish Bhagdev Thu, 11 Oct 2007 16:35:20 -0700

Hi Mike,

Thanks for your reply :)


I am not an expert of either! But, I understand that Nutch stores
contents albeit in a separate data structure (they call segment as
discussed in the thread), but what I meant was that this seems like
much more efficient way of presenting summaries or snippets (of course
for apps that need these only) than using a stored field which is only
option in solr -  not only resulting in a huge index size but reducing
speed of retrieval because of this increase in size (this is
admittedly a guess, would like to know if not the case).  Also for
queries only requesting ids/urls, the segments would never be touched
even for first n results...

Cheers.
Ravish

On 10/12/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> First, it should be noted that I am not an expert in Nutch's
> architure.  I do think I understand what is being said there, however.
>
> Nutch is a distributed web search engine, and uses lucene as a
> indexing component.  It is free to use external data structures to
> store data, and can store the index on a different machine than the
> contents are stored.  They can be updated independently.
>
> One reason why this is more efficient is that in a distributed
> architecture, more documents are retrieved over the system than are
> eventually summarized and output.  It makes no sense to shovel around
> the contents of all these documents if summaries are only being
> returned for the top 10 over the whole system.
>
> But Nutch is still storing the contents _somewhere_.  They haven't
> found a magical technique that makes this need disappear.
>
> So, does an external store make sense for Solr? Well, unlike Nutch,
> Solr is a solitary unit.  If you ask for 10 docs returned, with
> summaries, all of their contents are going to have to be retrieved.
> There aren't any advantages to storing the contents in a separate
> data structure (which will be the same size).
>
> Now, if you are using Solr in a large-scale distributed federated
> way, then you can replicate Nutch's strategy by storing the index in
> one Solr index, and the contents in another.  This could also yield
> benefits in a single-machine context if your code access many more
> documents than it wants summarized.
>
> Keep in mind also that Solr has facilities to help you manage the
> size of the content store.  Are you stripping your contents to their
> bare minima (removing HTML, etc)?  Are you using a compressed text
> field (highly recommended for this kind of data)?
>
> Believe me, if I found that there was a way of providing summaries
> without storing doc contents, I would pee my pants with happiness and
> it would be in Solr faster than you can say "diaper".
>
> cheers,
> -Mike
>
> On 11-Oct-07, at 3:48 PM, Ravish Bhagdev wrote:
>
> > Hey guys,
> >
> > Checkout this thread I opened on nutch mailing list.  Looks like Solr
> > can benefit from reusing Nutch's "segment" based storage strategy for
> > efficiency in returning snippets, summaries etc without using Lucene
> > stored fields?
> >
> > Was this considered before?
> >
> > Ravish
> >
> > ---------- Forwarded message ----------
> > From: Dennis Kubes <[EMAIL PROTECTED]>
> > Date: Oct 11, 2007 11:27 PM
> > Subject: Re: snippets and stored field in nutch...
> > To: [EMAIL PROTECTED]
> >
> >
> > The reason it is stored in the segments instead of index to allow
> > summarizers to be run on the content of hits to produce the summaries
> > that appear in the search results.  Summarizers are pluggable and the
> > actual content used to produce the summary can change.  And summaries
> > can be changed without re-fetching or re-indexing.  If a summary were
> > stored in the index, re-indexing would have to occur to make changes.
> >
> > Also the way the search process works, Nutch returns hits (basically
> > document ids).  These hits are then sorted and deduped and the best x
> > number (usually 10) returned.  For only these 10 best hits, hit
> > details
> > (fields in the index) and summaries are retrieved.  So there is
> > something to be said about the amount of data being pushed over the
> > network.
> >
> > Dennis Kubes
> >
> > Ravish Bhagdev wrote:
> >> Ah, I see, didn't know that, Thanks!
> >>
> >> Interesting that nutch stores it in a different structure (segments)
> >> and doesn't reuse Lucene strategy of storing within index.  Any
> >> particular reason why?  Is there any other use of "Segments" data
> >> structure except to return snippets?
> >>
> >> Cheers,
> >> Ravish
> >>
> >> On 10/11/07, John H. Lee <[EMAIL PROTECTED]> wrote:
> >>> Hi Ravish.
> >>>
> >>> You are correct that Nutch does not store document content in the
> >>> Lucene index. The content *is* stored in the Nutch segment, which is
> >>> where snippets come from.
> >>>
> >>> Hope this helps.
> >>>
> >>> -J
> >>>
> >>>
> >>> On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:
> >>>
> >>>> Hey All,
> >>>>
> >>>> Am I right in believing that in Lucene/Nutch, to be able to return
> >>>> content or snippet to a search query, the field to be returned
> >>>> has to
> >>>> be stored?
> >>>>
> >>>> AFAIK, by default, Nutch dose not store the document field, am I
> >>>> right?  If so, how does it manage to return snippets?  Wouldn't the
> >>>> index be quite huge if nutch were storing document field by
> >>>> default?
> >>>>
> >>>> I will appreciate any help/comments as I'm bit lost with this.
> >>>>
> >>>> Ravi
> >>>
>
>

Re: solr, snippets and stored field in nutch...

Reply via email to