Re: Single call for distributed IDF?

Joel Bernstein Tue, 31 Jan 2017 09:47:59 -0800

I think I understand the process you describe in your blog. I'm not sure
that I would choose to do that approach. For some Streaming Expressions
work I was doing I fetched the global IDF for the specific terms upfront at
the aggregator node. This was taking around 5-10 milli-seconds in my tests,
and I implemented no caching at all with this, so the calls to shards were
made each time. With a simple cache it could be made more efficient. Once
you have the IDF at aggregator node you could push the global IDF to the
shards pretty easily. Granted this does involve another call to the shards,
but the overhead was so low that it seemed acceptable.


This is quite different then what you describe and also quite different
then the stats caching approach which is currently in Solr.

Maybe I'm just bias to my own approach, but it seems simple and fast.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Jan 31, 2017 at 11:59 AM, Walter Underwood <wun...@wunderwood.org>
wrote:

> The usual reason to do a second call to get the stats for global IDF is to
> get around an Infoseek patent on the single call version. But that patent
> finally expired a couple of years ago, so now there is no reason to do a
> second call.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jan 24, 2017, at 11:01 AM, Walter Underwood <wun...@wunderwood.org>
> wrote:
> >
> > Specifically, I’m talking about this:
> >
> >     <statsCache class="org.apache.solr.search.stats.LRUStatsCache”/>
> >
> > Adding that line increased our 95th percentile response time by 10
> seconds.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> >> On Jan 24, 2017, at 10:43 AM, Joel Bernstein <joels...@gmail.com
> <mailto:joels...@gmail.com>> wrote:
> >>
> >> Ah, I thought you were just interested in a fast way to get at IDF. This
> >> approach does take a callback but it's really fast.
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
> >>
> >> On Tue, Jan 24, 2017 at 1:39 PM, Walter Underwood <
> wun...@wunderwood.org>
> >> wrote:
> >>
> >>> I know how to do it. You return df for each term and num_docs then
> >>> recalculate idf. I wrote up how we did it in Ultraseek XPA about ten
> years
> >>> ago, though with MonkeyRank instead of global IDF.
> >>>
> >>> https://observer.wunderwood.org/2007/04/04/progressive-reranking/ <
> >>> https://observer.wunderwood.org/2007/04/04/progressive-reranking/>
> >>>
> >>> I was wondering why Solr makes a separate request to each shard for
> that
> >>> information instead of piggybacking it on the original request.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>
> >>>> On Jan 24, 2017, at 10:34 AM, Joel Bernstein <joels...@gmail.com>
> wrote:
> >>>>
> >>>> This may help out:
> >>>> https://github.com/apache/lucene-solr/blob/master/solr/
> >>> solrj/src/java/org/apache/solr/client/solrj/io/stream/
> >>> ScoreNodesStream.java#L208
> >>>>
> >>>> This points to some code that calculates global idf for a list of
> terms.
> >>>> Not sure if this matches you use case. It seems to be very fast.
> >>>>
> >>>> Joel Bernstein
> >>>> http://joelsolr.blogspot.com/
> >>>>
> >>>> On Tue, Jan 24, 2017 at 1:09 PM, Walter Underwood <
> wun...@wunderwood.org
> >>>>
> >>>> wrote:
> >>>>
> >>>>> I tried running with the LRUStatsCache for global IDF, but the
> >>> performance
> >>>>> penalty was pretty big. The 95th percentile response time went from
> 3.4
> >>>>> seconds to 13 seconds. Oops.
> >>>>>
> >>>>> We should not need a separate call to get the tf and df stats. Those
> are
> >>>>> already calculated when doing the first request. I worked on a search
> >>>>> engine that did it that way twenty years ago.
> >>>>>
> >>>>> In the past, there would have been an IP obstacle, but I think that
> is
> >>>>> resolved.
> >>>>>
> >>>>> wunder
> >>>>> Walter Underwood
> >>>>> wun...@wunderwood.org
> >>>>> http://observer.wunderwood.org/  (my blog)
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >
>
>

Re: Single call for distributed IDF?

Reply via email to