[GitHub] [lucene] jpountz commented on a diff in pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

via GitHub Mon, 06 Feb 2023 08:04:11 -0800


jpountz commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1097579916



##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -380,21 +431,28 @@ public ScorerSupplier scorerSupplier(LeafReaderContext 
context) throws IOExcepti
         // cost estimates.
         final long cost;
         final long queryTermsCount = termData.size();
-        long potentialExtraCost = indexTerms.getSumDocFreq();
+        final long sumDocFreq = indexTerms.getSumDocFreq();
+        long potentialExtraCost = sumDocFreq;
         final long indexedTermCount = indexTerms.size();
         if (indexedTermCount != -1) {
           potentialExtraCost -= indexedTermCount;
         }
         cost = queryTermsCount + potentialExtraCost;
 
+        final boolean isPrimaryKeyField = indexedTermCount != -1 && sumDocFreq 
== indexedTermCount;

Review Comment:
   Since `terms.size()` is an optional index statistic, maybe check `sumDocFreq 
== docCount` instead?



##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -258,13 +271,41 @@ public Matches matches(LeafReaderContext context, int 
doc) throws IOException {
        * On the given leaf context, try to either rewrite to a disjunction if 
there are few matching
        * terms, or build a bitset containing matching docs.
        */
-      private WeightOrDocIdSet rewrite(LeafReaderContext context) throws 
IOException {
+      private WeightOrDocIdSet rewrite(
+          LeafReaderContext context, long leadCost, boolean isPrimaryKeyField, 
DocValuesType dvType)
+          throws IOException {
         final LeafReader reader = context.reader();
 
         Terms terms = reader.terms(field);
         if (terms == null) {
           return null;
         }
+
+        long costThreshold = Long.MAX_VALUE;
+        if (dvType == DocValuesType.SORTED || dvType == 
DocValuesType.SORTED_SET) {
+          // Establish a threshold for switching to doc values. Give postings 
a significant
+          // advantage for the primary-key case, since many of the primary-key 
terms may not
+          // actually be in this segment. The 8x factor is arbitrary, based on 
IndexOrDVQuery,
+          // but has performed well in benchmarks:
+          costThreshold = isPrimaryKeyField ? leadCost << 3 : leadCost;
+
+          if (termData.size() > costThreshold) {
+            // If the number of terms is > the number of candidates, DV should 
perform better.

Review Comment:
   I wonder if this is right given that the doc-values query still eagerly 
evaluates all terms against the terms dictionary. For this to work correctly, 
we'd need a query that looks up terms lazily rather than eagerly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] jpountz commented on a diff in pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Reply via email to