[jira] [Commented] (LUCENE-9107) CommonsTermsQuery with huge no. of terms slower with top-k scoring

Vincenzo D'Amore (Jira) Fri, 07 Aug 2020 08:00:25 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173187#comment-17173187
 ]


Vincenzo D'Amore commented on LUCENE-9107:
------------------------------------------

Hi, I did a little step further trying to identify the difference of 
performance using CommonTermsQuery with different versions of Solr (7.3.1 vs 
8.6.0).

In this fork of anserini repo branch test_8.6.0 
[https://github.com/freedev/anserini/blob/test_8.6.0]

There I was trying the ann sample, here the steps to reproduce the problem:
 copy and build
{quote}{{git clone [https://github.com/freedev/anserini.git]}}
 {{git checkout test_8.6.0}}
 {{mvn -Prelease clean package}}
{quote}
create the lucene index
{quote}{{java -cp target/anserini-0.9.5-SNAPSHOT-fatjar.jar 
io.anserini.ann.IndexVectors -input glove.6B.300d.txt -path glove300-idx-8.6.0 
-encoding fw}}
{quote}
reproduce the issue (the vector used for the world apple is hardcoded into the 
ApproximateNearestNeighborSearch main)
{quote}{{java -cp target/anserini-0.9.5-SNAPSHOT-fatjar.jar 
io.anserini.ann.ApproximateNearestNeighborSearch -input glove.6B.300d.txt -path 
glove300-idx-8.6.0 -encoding fw -word apple}}
{quote}
 

This is the VisualVM Sampler output after having monitored 
{{ApproximateNearestNeighborSearch}} with Java Flight Recorder

!image-2020-08-07-16-54-27-905.png|width=921,height=609!

Changing the line [186 in 
ApproximateNearestNeighborSearch|https://github.com/freedev/anserini/blob/test_8.6.0/src/main/java/io/anserini/ann/ApproximateNearestNeighborSearch.java#L186]

from:

{{TopScoreDocCollector.create(indexArgs.depth, 0);}}

to:

{{TopScoreDocCollector.create(indexArgs.depth, Integer.MAX_VALUE);}}

greately reduces the time spent (from ~2 sec to 3-400 milliseconds), see the 
screenshot:

 

!Screenshot 2020-08-07 at 16.20.05.png|width=927,height=613!

> CommonsTermsQuery with huge no. of terms slower with top-k scoring
> ------------------------------------------------------------------
>
>                 Key: LUCENE-9107
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9107
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 8.3
>            Reporter: Tommaso Teofili
>            Priority: Major
>         Attachments: Screenshot 2020-08-07 at 16.20.01.png, Screenshot 
> 2020-08-07 at 16.20.05.png, image-2020-08-07-16-54-27-905.png
>
>
> In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots 
> of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low 
> frequency terms, the query, although big, finishes in around 2-300ms with 
> Lucene 7.6.0. 
> However, when upgrading the code to Lucene 8.x, the query runs in 2-3s 
> instead [2].
> After digging a bit into it it seems that the regression in speed comes from 
> the fact that top-k scoring introduced by default in version 8 is causing 
> that, not sure "where" exactly in the code though.
> When switching back to complete hit scoring [3], the speed goes back to the 
> initial 2-300ms also in Lucene 8.3.x.
> It'd be nice to understand the reason why this is happening and if it is only 
> concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well.
> If this is a case that depends on the data and application involved (Anserini 
> in this case), the application should handle it, otherwise if it is a 
> regression/bug in Lucene it'd be nice to fix it.
> [1] : 
> https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
> [2] : 
> https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java
> [3] : 
> https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9107) CommonsTermsQuery with huge no. of terms slower with top-k scoring

Reply via email to