Re: Single call for distributed IDF?

Walter Underwood Tue, 31 Jan 2017 09:00:01 -0800

The usual reason to do a second call to get the stats for global IDF is to get 
around an Infoseek patent on the single call version. But that patent finally 
expired a couple of years ago, so now there is no reason to do a second call.


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 24, 2017, at 11:01 AM, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> Specifically, I’m talking about this:
> 
>     <statsCache class="org.apache.solr.search.stats.LRUStatsCache”/>
> 
> Adding that line increased our 95th percentile response time by 10 seconds.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jan 24, 2017, at 10:43 AM, Joel Bernstein <joels...@gmail.com 
>> <mailto:joels...@gmail.com>> wrote:
>> 
>> Ah, I thought you were just interested in a fast way to get at IDF. This
>> approach does take a callback but it's really fast.
>> 
>> Joel Bernstein
>> http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
>> 
>> On Tue, Jan 24, 2017 at 1:39 PM, Walter Underwood <wun...@wunderwood.org>
>> wrote:
>> 
>>> I know how to do it. You return df for each term and num_docs then
>>> recalculate idf. I wrote up how we did it in Ultraseek XPA about ten years
>>> ago, though with MonkeyRank instead of global IDF.
>>> 
>>> https://observer.wunderwood.org/2007/04/04/progressive-reranking/ <
>>> https://observer.wunderwood.org/2007/04/04/progressive-reranking/>
>>> 
>>> I was wondering why Solr makes a separate request to each shard for that
>>> information instead of piggybacking it on the original request.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Jan 24, 2017, at 10:34 AM, Joel Bernstein <joels...@gmail.com> wrote:
>>>> 
>>>> This may help out:
>>>> https://github.com/apache/lucene-solr/blob/master/solr/
>>> solrj/src/java/org/apache/solr/client/solrj/io/stream/
>>> ScoreNodesStream.java#L208
>>>> 
>>>> This points to some code that calculates global idf for a list of terms.
>>>> Not sure if this matches you use case. It seems to be very fast.
>>>> 
>>>> Joel Bernstein
>>>> http://joelsolr.blogspot.com/
>>>> 
>>>> On Tue, Jan 24, 2017 at 1:09 PM, Walter Underwood <wun...@wunderwood.org
>>>> 
>>>> wrote:
>>>> 
>>>>> I tried running with the LRUStatsCache for global IDF, but the
>>> performance
>>>>> penalty was pretty big. The 95th percentile response time went from 3.4
>>>>> seconds to 13 seconds. Oops.
>>>>> 
>>>>> We should not need a separate call to get the tf and df stats. Those are
>>>>> already calculated when doing the first request. I worked on a search
>>>>> engine that did it that way twenty years ago.
>>>>> 
>>>>> In the past, there would have been an IP obstacle, but I think that is
>>>>> resolved.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>

Re: Single call for distributed IDF?

Reply via email to