[I] Term Query are slower post Lucene 9.12 for fields with IndexOptions.DOCS [lucene]

via GitHub Sun, 06 Apr 2025 21:54:22 -0700


expani opened a new issue, #14445:
URL: https://github.com/apache/lucene/issues/14445


   ### Description
   
   
   A simple term query `searcher.search(new TermQuery(new Term(fieldName, 
fieldValue)), hits)` on a field which is only indexed with `IndexOptions.DOCS` 
is slower from 9.12.0 onwards. 
   
   Root cause : 
   
   - Collector sets the minimum competitive score on DISI after it has gathered 
the required hits. [TopScoreDocCollector sets it 
here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L173).
 
   It is an indication to the DISI that when scorer calls the `nextDoc` again 
it should return a document that has a score >= minCompetitiveScore. 
   
   - ImpactsDISI does a [shallowAdvance via MaxScoreCache during 
nextDoc](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L58)
 ( doesn't load the docIds into buffer just moves the file pointers and updates 
other internal variables in the underlying ImpactsEnum ) after the 
minCompetitiveScore is set by collector. It tries to move to the block in 
postings that will contain the target ( docId + 1 ) with a higher score.
   The shallowAdvance by `ImpactsSource` is working as expected in all 
versions. 
   
   - The issue occurs when we try to [compute the max score at level zero after 
the 
shallowAdvance](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L59-L60)
 This happens because from 9.12.0 the `Impact` returned is always 
`DUMMY_IMPACTS` for fields with `IndexOptions.DOCS`. This causes the maxScore 
calculated by 
[Similarity$SimScorer](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MaxScoreCache.java#L72-L79)
 to be always greater than the minimum competitive score which leads to all 
docIds being treated as competitive even though they aren't. 
   
   Possible Solutions : 
   
   - Store the field with `IndexOptions.DOCS_AND_FREQS` as this doesn't return 
`DUMMY_IMPACTS` but actually reads the impact value. 
   
   - If we don't care about the score and are using a TermQuery as a pure 
filter, then wrapping the query with a ConstantScoreQuery resolves the issue. 
This works because [ConstantScoreScorer sets it's delegate as 
empty](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ConstantScoreScorer.java#L124)
 once minimum competitive score is set. Thanks @msfroh for suggesting this 
alternative.
   
   - Another way of resolving the issue is [returning a term frequency of 
1](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java#L74)
 instead of `NO_MORE_DOCS` in DummyImpacts returned by BlockPostingsEnum. 
Thanks @jpountz for suggesting this alternative.
   However, we would need to handle cases like [ExactPhraseMatcher which depend 
on the frequency of dummy 
impacts](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/ExactPhraseMatcher.java#L282-L284).
   
   - I also tried a different approach of not returning `DUMMY_IMPACTS` for 
fields with `IndexOptions.DOCS` since we already have the Impact information 
required. 
   
   <details>
   
   <summary>Expand for Git Diff</summary>
   
   ```
   diff --git 
a/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java
 
b/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java
   index 1efcdf554dd..0a7c048ee95 100644
   --- 
a/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java
   +++ 
b/lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java
   @@ -422,7 +422,7 @@ public final class Lucene101PostingsReader extends 
PostingsReaderBase {
            Arrays.fill(freqBuffer, 1);
          }
   
   -      if (needsFreq && needsImpacts) {
   +      if (needsImpacts) {
            level0SerializedImpacts = new BytesRef(maxImpactNumBytesAtLevel0);
            level1SerializedImpacts = new BytesRef(maxImpactNumBytesAtLevel1);
            level0Impacts = new MutableImpactList(maxNumImpactsAtLevel0);
   @@ -1107,9 +1107,6 @@ public final class Lucene101PostingsReader extends 
PostingsReaderBase {
   
              @Override
              public int getDocIdUpTo(int level) {
   -            if (indexHasFreq == false) {
   -              return NO_MORE_DOCS;
   -            }
                if (level == 0) {
                  return level0LastDocID;
                }
   @@ -1118,14 +1115,12 @@ public final class Lucene101PostingsReader extends 
PostingsReaderBase {
   
              @Override
              public List<Impact> getImpacts(int level) {
   -            if (indexHasFreq) {
                  if (level == 0 && level0LastDocID != NO_MORE_DOCS) {
                    return readImpacts(level0SerializedImpacts, level0Impacts);
                  }
                  if (level == 1) {
                    return readImpacts(level1SerializedImpacts, level1Impacts);
                  }
   -            }
                return DUMMY_IMPACTS;
              }
   ```
   
   </details>
   
   Will follow up with a PR on the same after discussion b/w approach 3 and 4. 
   
   
   ### Version and environment details
   
   Create a Big5 index using OSB workload using OpenSearch 2.17 and OS >=2.18 
   
   ```
   opensearch-benchmark execute-test --target-hosts http://127.0.0.1:9200 
--workload big5 --client-options timeout:120 
--workload-params="number_of_shards:1,bulk_indexing_clients:1,number_of_replicas:0"
 --kill-running-processes --include-tasks 
delete-index,create-index,check-cluster-health,index-append,refresh-after-index,force-merge,refresh-after-force-merge,wait-until-merges-finish
   ```
   
   Run a term query as below : 
   
   ```
   curl -X POST "http://localhost:9200/big5/_search"; \
   -H "Content-Type: application/json" \
   -d '{
     "query": {
       "term": {
         "process.name": "kernel"
       }
     }
   }'
   ```
   
   In OS 2.17 ( Lucene 9.11.1 ) the query will complete in <= 5ms whereas in OS 
2.18 ( Lucene 9.12.0 ) the query will take >= 200ms 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Term Query are slower post Lucene 9.12 for fields with IndexOptions.DOCS [lucene]

Reply via email to