[GitHub] [lucene] zacharymorn commented on a change in pull request #418: LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery

GitBox Mon, 29 Nov 2021 23:46:25 -0800


zacharymorn commented on a change in pull request #418:
URL: https://github.com/apache/lucene/pull/418#discussion_r759006970




##########
File path: 
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -441,6 +491,273 @@ public boolean isCacheable(LeafReaderContext ctx) {
     }
   }
 
+  /** Merge impacts for combined field. */
+  static ImpactsSource mergeImpacts(
+      Map<String, List<ImpactsEnum>> fieldsWithImpactsEnums,
+      Map<String, List<Impacts>> fieldsWithImpacts,
+      Map<String, Float> fieldWeights) {
+    return new ImpactsSource() {
+
+      class SubIterator {
+        final Iterator<Impact> iterator;
+        int previousFreq;
+        Impact current;
+
+        SubIterator(Iterator<Impact> iterator) {
+          this.iterator = iterator;
+          this.current = iterator.next();
+        }
+
+        void next() {
+          previousFreq = current.freq;
+          if (iterator.hasNext() == false) {
+            current = null;
+          } else {
+            current = iterator.next();
+          }
+        }
+      }
+
+      @Override
+      public Impacts getImpacts() throws IOException {
+        // Use the impacts that have the lower next boundary (doc id in skip 
entry) as a lead for
+        // each field
+        // They collectively will decide on the number of levels and the block 
boundaries.
+        Map<String, Impacts> leadingImpactsPerField = new 
HashMap<>(fieldsWithImpactsEnums.size());
+
+        for (Map.Entry<String, List<ImpactsEnum>> fieldImpacts :
+            fieldsWithImpactsEnums.entrySet()) {
+          String field = fieldImpacts.getKey();
+          List<ImpactsEnum> impactsEnums = fieldImpacts.getValue();
+          fieldsWithImpacts.put(field, new ArrayList<>(impactsEnums.size()));
+
+          Impacts tmpLead = null;
+          // find the impact that has the lowest next boundary for this field
+          for (int i = 0; i < impactsEnums.size(); ++i) {
+            Impacts impacts = impactsEnums.get(i).getImpacts();
+            fieldsWithImpacts.get(field).add(impacts);
+
+            if (tmpLead == null || impacts.getDocIdUpTo(0) < 
tmpLead.getDocIdUpTo(0)) {
+              tmpLead = impacts;
+            }
+          }
+
+          leadingImpactsPerField.put(field, tmpLead);
+        }
+
+        return new Impacts() {
+
+          @Override
+          public int numLevels() {
+            // max of levels across fields' impactEnums
+            int result = 0;
+
+            for (Impacts impacts : leadingImpactsPerField.values()) {
+              result = Math.max(result, impacts.numLevels());
+            }
+
+            return result;
+          }
+
+          @Override
+          public int getDocIdUpTo(int level) {
+            // min of docIdUpTo across fields' impactEnums
+            int result = Integer.MAX_VALUE;
+
+            for (Impacts impacts : leadingImpactsPerField.values()) {
+              if (impacts.numLevels() > level) {
+                result = Math.min(result, impacts.getDocIdUpTo(level));
+              }
+            }
+
+            return result;
+          }

Review comment:
       I re-ran the perf tests `combinedFieldsUnevenlyWeightedBig` after bug 
fix 
https://github.com/apache/lucene/pull/418/commits/3d0a215d0174f13b508560bd476cd68475f05864,
 and the results look similar:
   
   With commit `test leadImpact from highest weighted field` 
https://github.com/apache/lucene/pull/418/commits/db2446f0337e5860bb8c731df1f6c16281efb6bb
   
   Run 1:
   ```
            TaskQPS baseline      StdDevQPS my_modified_version      StdDev     
           Pct diff p-value
        CFQHighHigh        2.81     (10.9%)        2.40     (10.3%)  -14.6% ( 
-32% -    7%) 0.000
         CFQHighMed        3.27     (11.8%)        2.91     (11.0%)  -11.2% ( 
-30% -   13%) 0.002
         CFQHighLow       16.09     (14.6%)       14.58     (11.9%)   -9.4% ( 
-31% -   20%) 0.025
        PKLookup       92.08     (11.6%)       96.23     (13.3%)    4.5% ( -18% 
-   33%) 0.255
   ```
   
   Run 2:
   ```
            TaskQPS baseline      StdDevQPS my_modified_version      StdDev     
           Pct diff p-value
         CFQHighLow       15.76     (12.0%)       10.19     (12.9%)  -35.3% ( 
-53% -  -11%) 0.000
         CFQHighMed       12.01     (13.3%)        8.79     (13.9%)  -26.9% ( 
-47% -    0%) 0.000
        CFQHighHigh        5.59     (10.8%)        4.40      (8.4%)  -21.2% ( 
-36% -   -2%) 0.000
        PKLookup       93.36     (21.9%)       94.29     (22.6%)    1.0% ( -35% 
-   58%) 0.888
   ```
   
   ---
   
   Without commit `test leadImpact from highest weighted field` 
https://github.com/apache/lucene/pull/418/commits/db2446f0337e5860bb8c731df1f6c16281efb6bb
   
   Run 1:
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
         CFQHighLow       16.69      (7.9%)       15.02     (14.1%)  -10.0% ( 
-29% -   13%) 0.006
           PKLookup       94.38      (9.7%)      105.59     (13.3%)   11.9% ( 
-10% -   38%) 0.001
        CFQHighHigh        5.51      (7.0%)        7.59     (13.5%)   37.6% (  
15% -   62%) 0.000
         CFQHighMed       10.24      (7.2%)       15.23     (20.0%)   48.8% (  
20% -   81%) 0.000
   ```
   
   Run 2:
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
           PKLookup       94.85     (11.6%)       92.29     (11.0%)   -2.7% ( 
-22% -   22%) 0.448
        CFQHighHigh        5.26     (13.7%)        6.25     (23.4%)   18.9% ( 
-16% -   64%) 0.002
         CFQHighMed        3.39     (16.1%)        4.55     (29.7%)   34.3% (  
-9% -   95%) 0.000
         CFQHighLow       16.04     (14.9%)       29.22     (39.6%)   82.2% (  
24% -  160%) 0.000
   ```
   
   Run 3:
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
           PKLookup       68.96     (40.4%)       67.26     (40.1%)   -2.5% ( 
-59% -  130%) 0.847
        CFQHighHigh        4.85     (16.7%)        5.62     (33.8%)   15.7% ( 
-29% -   79%) 0.062
         CFQHighMed       10.67     (20.1%)       12.47     (42.7%)   16.9% ( 
-38% -   99%) 0.110
         CFQHighLow       14.07     (22.6%)       23.18     (68.7%)   64.8% ( 
-21% -  201%) 0.000
   ```
   
   I guess the scoring function has taken care of / bounded the erroneous norm 
used in approximation earlier as it was effectively `Long.MIN_VALUE`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on a change in pull request #418: LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery

Reply via email to