[GitHub] [lucene] benwtrent commented on a diff in pull request #12382: Run top-level conjunctions of term queries with a specialized BulkScorer.

via GitHub Thu, 21 Sep 2023 07:08:56 -0700


benwtrent commented on code in PR #12382:
URL: https://github.com/apache/lucene/pull/12382#discussion_r1333097677



##########
lucene/core/src/java/org/apache/lucene/search/BlockMaxConjunctionBulkScorer.java:
##########
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import org.apache.lucene.search.Weight.DefaultBulkScorer;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.MathUtil;
+
+/**
+ * BulkScorer implementation of {@link BlockMaxConjunctionScorer} that focuses 
on top-level
+ * conjunctions over clauses that do not have two-phase iterators. Use a 
{@link DefaultBulkScorer}
+ * around a {@link BlockMaxConjunctionScorer} if you need two-phase support.
+ * Another difference with {@link BlockMaxConjunctionScorer} is that this 
scorer computes scores on
+ * the fly in order to be able to skip evaluating more clauses if the total 
score would be under the
+ * minimum competitive score anyway. This generally works well because 
computing a score is cheaper
+ * than
+ */
+final class BlockMaxConjunctionBulkScorer extends BulkScorer {
+
+  private final Scorer[] scorers;
+  private final DocIdSetIterator[] iterators;
+  private final DocIdSetIterator lead;
+  private final DocAndScore scorable = new DocAndScore();
+  private final double[] sumOfOtherClauses;
+
+  BlockMaxConjunctionBulkScorer(List<Scorer> scorers) throws IOException {
+    if (scorers.size() <= 1) {
+      throw new IllegalArgumentException("Expected 2 or more scorers, got " + 
scorers.size());
+    }
+    this.scorers = scorers.toArray(Scorer[]::new);
+    Arrays.sort(this.scorers, Comparator.comparingLong(scorer -> 
scorer.iterator().cost()));
+    this.iterators =
+        
Arrays.stream(this.scorers).map(Scorer::iterator).toArray(DocIdSetIterator[]::new);
+    lead = iterators[0];
+    this.sumOfOtherClauses = new double[this.scorers.length];
+  }
+
+  @Override
+  public int score(LeafCollector collector, Bits acceptDocs, int min, int max) 
throws IOException {
+    collector.setScorer(scorable);
+
+    int windowMin = Math.max(lead.docID(), min);
+    while (windowMin < max) {
+      // Use impacts of the least costly scorer to compute windows
+      // NOTE: windowMax is inclusive

Review Comment:
   Just to clarify my thinking, "least costly" indicates to me "matches the 
fewest docs", is this a correct intuition?



##########
lucene/CHANGES.txt:
##########
@@ -173,6 +173,9 @@ Optimizations
 * GITHUB#12361: Faster top-level disjunctions sorted by descending score.
   (Adrien Grand)
 
+* GITHUB#12382: Faster top-level conjunctions on term queries when sorting by
+  descending score. (Adrien Grand)
+

Review Comment:
   9.8? Or 9.9?



##########
lucene/core/src/java/org/apache/lucene/search/BlockMaxConjunctionBulkScorer.java:
##########
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import org.apache.lucene.search.Weight.DefaultBulkScorer;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.MathUtil;
+
+/**
+ * BulkScorer implementation of {@link BlockMaxConjunctionScorer} that focuses 
on top-level
+ * conjunctions over clauses that do not have two-phase iterators. Use a 
{@link DefaultBulkScorer}
+ * around a {@link BlockMaxConjunctionScorer} if you need two-phase support.
+ * Another difference with {@link BlockMaxConjunctionScorer} is that this 
scorer computes scores on
+ * the fly in order to be able to skip evaluating more clauses if the total 
score would be under the
+ * minimum competitive score anyway. This generally works well because 
computing a score is cheaper
+ * than
+ */
+final class BlockMaxConjunctionBulkScorer extends BulkScorer {
+
+  private final Scorer[] scorers;
+  private final DocIdSetIterator[] iterators;
+  private final DocIdSetIterator lead;
+  private final DocAndScore scorable = new DocAndScore();
+  private final double[] sumOfOtherClauses;
+
+  BlockMaxConjunctionBulkScorer(List<Scorer> scorers) throws IOException {
+    if (scorers.size() <= 1) {
+      throw new IllegalArgumentException("Expected 2 or more scorers, got " + 
scorers.size());
+    }
+    this.scorers = scorers.toArray(Scorer[]::new);
+    Arrays.sort(this.scorers, Comparator.comparingLong(scorer -> 
scorer.iterator().cost()));
+    this.iterators =
+        
Arrays.stream(this.scorers).map(Scorer::iterator).toArray(DocIdSetIterator[]::new);
+    lead = iterators[0];
+    this.sumOfOtherClauses = new double[this.scorers.length];
+  }
+
+  @Override
+  public int score(LeafCollector collector, Bits acceptDocs, int min, int max) 
throws IOException {
+    collector.setScorer(scorable);
+
+    int windowMin = Math.max(lead.docID(), min);
+    while (windowMin < max) {
+      // Use impacts of the least costly scorer to compute windows
+      // NOTE: windowMax is inclusive
+      int windowMax = Math.min(scorers[0].advanceShallow(windowMin), max - 1);
+      for (int i = 1; i < scorers.length; ++i) {
+        scorers[i].advanceShallow(windowMin);
+      }
+
+      for (int i = 0; i < scorers.length; ++i) {
+        sumOfOtherClauses[i] = scorers[i].getMaxScore(windowMax);
+      }
+      double maxWindowScore = 0;
+      for (double maxScore : sumOfOtherClauses) {
+        maxWindowScore += maxScore;
+      }
+      for (int i = sumOfOtherClauses.length - 2; i >= 0; --i) {
+        sumOfOtherClauses[i] += sumOfOtherClauses[i+1];
+      }
+      scoreWindow(collector, acceptDocs, windowMin, windowMax + 1, (float) 
maxWindowScore);
+      windowMin = Math.max(lead.docID(), windowMax + 1);
+    }
+
+    return windowMin;
+  }
+
+  private void scoreWindow(
+      LeafCollector collector, Bits acceptDocs, int min, int max, float 
maxWindowScore)
+      throws IOException {
+    if (maxWindowScore < scorable.minCompetitiveScore) {
+      // no hits are competitive
+      return;
+    }
+
+    if (lead.docID() < min) {
+      lead.advance(min);
+    }
+    advanceHead:
+    for (int doc = lead.docID(); doc < max; ) {
+      if (acceptDocs != null && acceptDocs.get(doc) == false) {
+        doc = lead.nextDoc();
+        continue;
+      }
+
+      // Compute the score as we find more matching clauses, in order to skip 
advancing other clauses if the total score has no chance of being competitive. 
This works well because computing a score is usually cheaper than decoding a 
full block of postings and frequencies.
+      final boolean hasMinCompetitiveScore = scorable.minCompetitiveScore > 0;
+      double currentScore;
+      if (hasMinCompetitiveScore) {
+        currentScore = scorers[0].score();
+      } else {
+        currentScore = 0;
+      }
+
+      for (int i = 1; i < iterators.length; ++i) {
+        // First check if we have a chance of having a match
+        if (hasMinCompetitiveScore && MathUtil.sumUpperBound(currentScore + 
sumOfOtherClauses[i], scorers.length) < scorable.minCompetitiveScore) {
+          doc = lead.nextDoc();
+          continue advanceHead;
+        }
+
+        // NOTE: these iterators may already be on `doc` already if we called 
`continue advanceHead`
+        // on the previous loop iteration.
+        if (iterators[i].docID() < doc) {
+          int next = iterators[i].advance(doc);
+          if (next != doc) {
+            doc = lead.advance(next);
+            continue advanceHead;
+          }
+        }
+        assert iterators[i].docID() == doc;
+        if (hasMinCompetitiveScore) {
+          currentScore += scorers[i].score();
+        }
+      }
+
+      if (hasMinCompetitiveScore == false) {
+        for (Scorer scorer : scorers) {
+          currentScore += scorer.score();
+        }
+      }
+      scorable.score = (float) currentScore;
+      collector.collect(doc);
+      // The collect() call may have updated the minimum competitive score.
+      if (maxWindowScore < scorable.minCompetitiveScore) {
+        // no more hits are competitive
+        return;
+      }
+
+      doc = lead.nextDoc();
+    }
+  }
+
+  @Override
+  public long cost() {
+    return lead.cost();
+  }
+
+  private class DocAndScore extends Scorable {

Review Comment:
   `static` ?



##########
lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java:
##########
@@ -250,31 +254,108 @@ BulkScorer optionalBulkScorer(LeafReaderContext context) 
throws IOException {
         this, optional, Math.max(1, query.getMinimumNumberShouldMatch()), 
scoreMode.needsScores());
   }
 
-  // Return a BulkScorer for the required clauses only,
-  // or null if it is not applicable
+  // Return a BulkScorer for the required clauses only
   private BulkScorer requiredBulkScorer(LeafReaderContext context) throws 
IOException {
-    BulkScorer scorer = null;
+    // Is there a single required clause by any chance? Then pull its bulk 
scorer.
+    Optional<WeightedBooleanClause> singleRequiredClause = null;

Review Comment:
   Following the triple check of "optional & null" is doing my head in.
   
   Could we make this a list and append required clauses?
   
   Then the checks below are:
   
   `requiredWeightedClauses.isEmpty()` (return null)
   `requiredWeightedClauses.size() == 1` (fall into the else if)
   
   Then the loop below would be
   ` for (WeightedBooleanClause wc : requiredWeightedClauses) {`
   
   And you can skip the checks for it being required.
   



##########
lucene/core/src/java/org/apache/lucene/search/BlockMaxConjunctionBulkScorer.java:
##########
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import org.apache.lucene.search.Weight.DefaultBulkScorer;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.MathUtil;
+
+/**
+ * BulkScorer implementation of {@link BlockMaxConjunctionScorer} that focuses 
on top-level
+ * conjunctions over clauses that do not have two-phase iterators. Use a 
{@link DefaultBulkScorer}
+ * around a {@link BlockMaxConjunctionScorer} if you need two-phase support.
+ * Another difference with {@link BlockMaxConjunctionScorer} is that this 
scorer computes scores on
+ * the fly in order to be able to skip evaluating more clauses if the total 
score would be under the
+ * minimum competitive score anyway. This generally works well because 
computing a score is cheaper
+ * than

Review Comment:
   than what?!?!? 
   
   I am held in suspense! 



##########
lucene/core/src/java/org/apache/lucene/search/BlockMaxConjunctionBulkScorer.java:
##########
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.List;
+import org.apache.lucene.search.Weight.DefaultBulkScorer;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.MathUtil;
+
+/**
+ * BulkScorer implementation of {@link BlockMaxConjunctionScorer} that focuses 
on top-level
+ * conjunctions over clauses that do not have two-phase iterators. Use a 
{@link DefaultBulkScorer}
+ * around a {@link BlockMaxConjunctionScorer} if you need two-phase support.
+ * Another difference with {@link BlockMaxConjunctionScorer} is that this 
scorer computes scores on
+ * the fly in order to be able to skip evaluating more clauses if the total 
score would be under the
+ * minimum competitive score anyway. This generally works well because 
computing a score is cheaper
+ * than
+ */
+final class BlockMaxConjunctionBulkScorer extends BulkScorer {
+
+  private final Scorer[] scorers;
+  private final DocIdSetIterator[] iterators;
+  private final DocIdSetIterator lead;
+  private final DocAndScore scorable = new DocAndScore();
+  private final double[] sumOfOtherClauses;
+
+  BlockMaxConjunctionBulkScorer(List<Scorer> scorers) throws IOException {
+    if (scorers.size() <= 1) {
+      throw new IllegalArgumentException("Expected 2 or more scorers, got " + 
scorers.size());
+    }
+    this.scorers = scorers.toArray(Scorer[]::new);
+    Arrays.sort(this.scorers, Comparator.comparingLong(scorer -> 
scorer.iterator().cost()));
+    this.iterators =
+        
Arrays.stream(this.scorers).map(Scorer::iterator).toArray(DocIdSetIterator[]::new);
+    lead = iterators[0];
+    this.sumOfOtherClauses = new double[this.scorers.length];
+  }
+
+  @Override
+  public int score(LeafCollector collector, Bits acceptDocs, int min, int max) 
throws IOException {
+    collector.setScorer(scorable);
+
+    int windowMin = Math.max(lead.docID(), min);
+    while (windowMin < max) {
+      // Use impacts of the least costly scorer to compute windows
+      // NOTE: windowMax is inclusive
+      int windowMax = Math.min(scorers[0].advanceShallow(windowMin), max - 1);
+      for (int i = 1; i < scorers.length; ++i) {
+        scorers[i].advanceShallow(windowMin);
+      }
+
+      for (int i = 0; i < scorers.length; ++i) {
+        sumOfOtherClauses[i] = scorers[i].getMaxScore(windowMax);
+      }
+      double maxWindowScore = 0;
+      for (double maxScore : sumOfOtherClauses) {
+        maxWindowScore += maxScore;
+      }

Review Comment:
   seems like this could be in the same loop since `sumOfOtherClauses.length == 
scorers.length`? Why are they separated?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent commented on a diff in pull request #12382: Run top-level conjunctions of term queries with a specialized BulkScorer.

Reply via email to