[ https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173187#comment-17173187 ]
Vincenzo D'Amore commented on LUCENE-9107: ------------------------------------------ Hi, I did a little step further trying to identify the difference of performance using CommonTermsQuery with different versions of Solr (7.3.1 vs 8.6.0). In this fork of anserini repo branch test_8.6.0 [https://github.com/freedev/anserini/blob/test_8.6.0] There I was trying the ann sample, here the steps to reproduce the problem: copy and build {quote}{{git clone [https://github.com/freedev/anserini.git]}} {{git checkout test_8.6.0}} {{mvn -Prelease clean package}} {quote} create the lucene index {quote}{{java -cp target/anserini-0.9.5-SNAPSHOT-fatjar.jar io.anserini.ann.IndexVectors -input glove.6B.300d.txt -path glove300-idx-8.6.0 -encoding fw}} {quote} reproduce the issue (the vector used for the world apple is hardcoded into the ApproximateNearestNeighborSearch main) {quote}{{java -cp target/anserini-0.9.5-SNAPSHOT-fatjar.jar io.anserini.ann.ApproximateNearestNeighborSearch -input glove.6B.300d.txt -path glove300-idx-8.6.0 -encoding fw -word apple}} {quote} This is the VisualVM Sampler output after having monitored {{ApproximateNearestNeighborSearch}} with Java Flight Recorder !image-2020-08-07-16-54-27-905.png|width=921,height=609! Changing the line [186 in ApproximateNearestNeighborSearch|https://github.com/freedev/anserini/blob/test_8.6.0/src/main/java/io/anserini/ann/ApproximateNearestNeighborSearch.java#L186] from: {{TopScoreDocCollector.create(indexArgs.depth, 0);}} to: {{TopScoreDocCollector.create(indexArgs.depth, Integer.MAX_VALUE);}} greately reduces the time spent (from ~2 sec to 3-400 milliseconds), see the screenshot: !Screenshot 2020-08-07 at 16.20.05.png|width=927,height=613! > CommonsTermsQuery with huge no. of terms slower with top-k scoring > ------------------------------------------------------------------ > > Key: LUCENE-9107 > URL: https://issues.apache.org/jira/browse/LUCENE-9107 > Project: Lucene - Core > Issue Type: Bug > Components: core/search > Affects Versions: 8.3 > Reporter: Tommaso Teofili > Priority: Major > Attachments: Screenshot 2020-08-07 at 16.20.01.png, Screenshot > 2020-08-07 at 16.20.05.png, image-2020-08-07-16-54-27-905.png > > > In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots > of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low > frequency terms, the query, although big, finishes in around 2-300ms with > Lucene 7.6.0. > However, when upgrading the code to Lucene 8.x, the query runs in 2-3s > instead [2]. > After digging a bit into it it seems that the regression in speed comes from > the fact that top-k scoring introduced by default in version 8 is causing > that, not sure "where" exactly in the code though. > When switching back to complete hit scoring [3], the speed goes back to the > initial 2-300ms also in Lucene 8.3.x. > It'd be nice to understand the reason why this is happening and if it is only > concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well. > If this is a case that depends on the data and application involved (Anserini > in this case), the application should handle it, otherwise if it is a > regression/bug in Lucene it'd be nice to fix it. > [1] : > https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java > [2] : > https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java > [3] : > https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org