expani opened a new issue, #14445: URL: https://github.com/apache/lucene/issues/14445
### Description A simple term query `searcher.search(new TermQuery(new Term(fieldName, fieldValue)), hits)` on a field which is only indexed with `IndexOptions.DOCS` is slower from 9.12.0 onwards. Root cause : - Collector sets the minimum competitive score on DISI after it has gathered the required hits. [TopScoreDocCollector sets it here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L173). It is an indication to the DISI that when scorer calls the `nextDoc` again it should return a document that has a score >= minCompetitiveScore. - ImpactsDISI does a [shallowAdvance via MaxScoreCache during nextDoc](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L58) ( doesn't load the docIds into buffer just moves the file pointers and updates other internal variables in the underlying ImpactsEnum ) after the minCompetitiveScore is set by collector. It tries to move to the block in postings that will contain the target ( docId + 1 ) with a higher score. The shallowAdvance by `ImpactsSource` is working as expected in all versions. - The issue occurs when we try to [compute the max score at level zero after the shallowAdvance](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L59-L60) This happens because from 9.12.0 the `Impact` returned is always `DUMMY_IMPACTS` for fields with `IndexOptions.DOCS`. This causes the maxScore calculated by [Similarity$SimScorer](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L72-L79) to be always greater than the minimum competitive score which leads to all docIds being treated as competitive even though they aren't. Possible Solutions : - Store the field with `IndexOptions.DOCS_AND_FREQS` as this doesn't return `DUMMY_IMPACTS` but actually reads the impact value. - If we don't care about the score and are using a TermQuery as a pure filter, then wrapping the query with a ConstantScoreQuery resolves the issue. This works because [ConstantScoreScorer sets it's delegate as empty](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ConstantScoreScorer.java#L124) once minimum competitive score is set. Thanks @msfroh for suggesting this alternative. - Another way of resolving the issue is [returning a term frequency of 1](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java#L74) instead of `NO_MORE_DOCS` in DummyImpacts returned by BlockPostingsEnum. Thanks @jpountz for suggesting this alternative. However, we would need to handle cases like [ExactPhraseMatcher which depend on the frequency of dummy impacts](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ExactPhraseMatcher.java#L282-L284). - I also tried a different approach of not returning `DUMMY_IMPACTS` for fields with `IndexOptions.DOCS` since we already have the Impact information required. <details> <summary>Expand for Git Diff</summary> ``` diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java index 1efcdf554dd..0a7c048ee95 100644 --- a/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java +++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java @@ -422,7 +422,7 @@ public final class Lucene101PostingsReader extends PostingsReaderBase { Arrays.fill(freqBuffer, 1); } - if (needsFreq && needsImpacts) { + if (needsImpacts) { level0SerializedImpacts = new BytesRef(maxImpactNumBytesAtLevel0); level1SerializedImpacts = new BytesRef(maxImpactNumBytesAtLevel1); level0Impacts = new MutableImpactList(maxNumImpactsAtLevel0); @@ -1107,9 +1107,6 @@ public final class Lucene101PostingsReader extends PostingsReaderBase { @Override public int getDocIdUpTo(int level) { - if (indexHasFreq == false) { - return NO_MORE_DOCS; - } if (level == 0) { return level0LastDocID; } @@ -1118,14 +1115,12 @@ public final class Lucene101PostingsReader extends PostingsReaderBase { @Override public List<Impact> getImpacts(int level) { - if (indexHasFreq) { if (level == 0 && level0LastDocID != NO_MORE_DOCS) { return readImpacts(level0SerializedImpacts, level0Impacts); } if (level == 1) { return readImpacts(level1SerializedImpacts, level1Impacts); } - } return DUMMY_IMPACTS; } ``` </details> Will follow up with a PR on the same after discussion b/w approach 3 and 4. ### Version and environment details Create a Big5 index using OSB workload using OpenSearch 2.17 and OS >=2.18 ``` opensearch-benchmark execute-test --target-hosts http://127.0.0.1:9200 --workload big5 --client-options timeout:120 --workload-params="number_of_shards:1,bulk_indexing_clients:1,number_of_replicas:0" --kill-running-processes --include-tasks delete-index,create-index,check-cluster-health,index-append,refresh-after-index,force-merge,refresh-after-force-merge,wait-until-merges-finish ``` Run a term query as below : ``` curl -X POST "http://localhost:9200/big5/_search" \ -H "Content-Type: application/json" \ -d '{ "query": { "term": { "process.name": "kernel" } } }' ``` In OS 2.17 ( Lucene 9.11.1 ) the query will complete in <= 5ms whereas in OS 2.18 ( Lucene 9.12.0 ) the query will take >= 200ms -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org