Hi Mike, Thanks for your reply :)
I am not an expert of either! But, I understand that Nutch stores contents albeit in a separate data structure (they call segment as discussed in the thread), but what I meant was that this seems like much more efficient way of presenting summaries or snippets (of course for apps that need these only) than using a stored field which is only option in solr - not only resulting in a huge index size but reducing speed of retrieval because of this increase in size (this is admittedly a guess, would like to know if not the case). Also for queries only requesting ids/urls, the segments would never be touched even for first n results... Cheers. Ravish On 10/12/07, Mike Klaas <[EMAIL PROTECTED]> wrote: > First, it should be noted that I am not an expert in Nutch's > architure. I do think I understand what is being said there, however. > > Nutch is a distributed web search engine, and uses lucene as a > indexing component. It is free to use external data structures to > store data, and can store the index on a different machine than the > contents are stored. They can be updated independently. > > One reason why this is more efficient is that in a distributed > architecture, more documents are retrieved over the system than are > eventually summarized and output. It makes no sense to shovel around > the contents of all these documents if summaries are only being > returned for the top 10 over the whole system. > > But Nutch is still storing the contents _somewhere_. They haven't > found a magical technique that makes this need disappear. > > So, does an external store make sense for Solr? Well, unlike Nutch, > Solr is a solitary unit. If you ask for 10 docs returned, with > summaries, all of their contents are going to have to be retrieved. > There aren't any advantages to storing the contents in a separate > data structure (which will be the same size). > > Now, if you are using Solr in a large-scale distributed federated > way, then you can replicate Nutch's strategy by storing the index in > one Solr index, and the contents in another. This could also yield > benefits in a single-machine context if your code access many more > documents than it wants summarized. > > Keep in mind also that Solr has facilities to help you manage the > size of the content store. Are you stripping your contents to their > bare minima (removing HTML, etc)? Are you using a compressed text > field (highly recommended for this kind of data)? > > Believe me, if I found that there was a way of providing summaries > without storing doc contents, I would pee my pants with happiness and > it would be in Solr faster than you can say "diaper". > > cheers, > -Mike > > On 11-Oct-07, at 3:48 PM, Ravish Bhagdev wrote: > > > Hey guys, > > > > Checkout this thread I opened on nutch mailing list. Looks like Solr > > can benefit from reusing Nutch's "segment" based storage strategy for > > efficiency in returning snippets, summaries etc without using Lucene > > stored fields? > > > > Was this considered before? > > > > Ravish > > > > ---------- Forwarded message ---------- > > From: Dennis Kubes <[EMAIL PROTECTED]> > > Date: Oct 11, 2007 11:27 PM > > Subject: Re: snippets and stored field in nutch... > > To: [EMAIL PROTECTED] > > > > > > The reason it is stored in the segments instead of index to allow > > summarizers to be run on the content of hits to produce the summaries > > that appear in the search results. Summarizers are pluggable and the > > actual content used to produce the summary can change. And summaries > > can be changed without re-fetching or re-indexing. If a summary were > > stored in the index, re-indexing would have to occur to make changes. > > > > Also the way the search process works, Nutch returns hits (basically > > document ids). These hits are then sorted and deduped and the best x > > number (usually 10) returned. For only these 10 best hits, hit > > details > > (fields in the index) and summaries are retrieved. So there is > > something to be said about the amount of data being pushed over the > > network. > > > > Dennis Kubes > > > > Ravish Bhagdev wrote: > >> Ah, I see, didn't know that, Thanks! > >> > >> Interesting that nutch stores it in a different structure (segments) > >> and doesn't reuse Lucene strategy of storing within index. Any > >> particular reason why? Is there any other use of "Segments" data > >> structure except to return snippets? > >> > >> Cheers, > >> Ravish > >> > >> On 10/11/07, John H. Lee <[EMAIL PROTECTED]> wrote: > >>> Hi Ravish. > >>> > >>> You are correct that Nutch does not store document content in the > >>> Lucene index. The content *is* stored in the Nutch segment, which is > >>> where snippets come from. > >>> > >>> Hope this helps. > >>> > >>> -J > >>> > >>> > >>> On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote: > >>> > >>>> Hey All, > >>>> > >>>> Am I right in believing that in Lucene/Nutch, to be able to return > >>>> content or snippet to a search query, the field to be returned > >>>> has to > >>>> be stored? > >>>> > >>>> AFAIK, by default, Nutch dose not store the document field, am I > >>>> right? If so, how does it manage to return snippets? Wouldn't the > >>>> index be quite huge if nutch were storing document field by > >>>> default? > >>>> > >>>> I will appreciate any help/comments as I'm bit lost with this. > >>>> > >>>> Ravi > >>> > >