zacharymorn commented on a change in pull request #418:
URL: https://github.com/apache/lucene/pull/418#discussion_r752930339



##########
File path: 
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -441,6 +491,273 @@ public boolean isCacheable(LeafReaderContext ctx) {
     }
   }
 
+  /** Merge impacts for combined field. */
+  static ImpactsSource mergeImpacts(
+      Map<String, List<ImpactsEnum>> fieldsWithImpactsEnums,
+      Map<String, List<Impacts>> fieldsWithImpacts,
+      Map<String, Float> fieldWeights) {
+    return new ImpactsSource() {
+
+      class SubIterator {
+        final Iterator<Impact> iterator;
+        int previousFreq;
+        Impact current;
+
+        SubIterator(Iterator<Impact> iterator) {
+          this.iterator = iterator;
+          this.current = iterator.next();
+        }
+
+        void next() {
+          previousFreq = current.freq;
+          if (iterator.hasNext() == false) {
+            current = null;
+          } else {
+            current = iterator.next();
+          }
+        }
+      }
+
+      @Override
+      public Impacts getImpacts() throws IOException {
+        // Use the impacts that have the lower next boundary (doc id in skip 
entry) as a lead for
+        // each field
+        // They collectively will decide on the number of levels and the block 
boundaries.
+        Map<String, Impacts> leadingImpactsPerField = new 
HashMap<>(fieldsWithImpactsEnums.size());
+
+        for (Map.Entry<String, List<ImpactsEnum>> fieldImpacts :
+            fieldsWithImpactsEnums.entrySet()) {
+          String field = fieldImpacts.getKey();
+          List<ImpactsEnum> impactsEnums = fieldImpacts.getValue();
+          fieldsWithImpacts.put(field, new ArrayList<>(impactsEnums.size()));
+
+          Impacts tmpLead = null;
+          // find the impact that has the lowest next boundary for this field
+          for (int i = 0; i < impactsEnums.size(); ++i) {
+            Impacts impacts = impactsEnums.get(i).getImpacts();
+            fieldsWithImpacts.get(field).add(impacts);
+
+            if (tmpLead == null || impacts.getDocIdUpTo(0) < 
tmpLead.getDocIdUpTo(0)) {
+              tmpLead = impacts;
+            }
+          }
+
+          leadingImpactsPerField.put(field, tmpLead);
+        }
+
+        return new Impacts() {
+
+          @Override
+          public int numLevels() {
+            // max of levels across fields' impactEnums
+            int result = 0;
+
+            for (Impacts impacts : leadingImpactsPerField.values()) {
+              result = Math.max(result, impacts.numLevels());
+            }
+
+            return result;
+          }
+
+          @Override
+          public int getDocIdUpTo(int level) {
+            // min of docIdUpTo across fields' impactEnums
+            int result = Integer.MAX_VALUE;
+
+            for (Impacts impacts : leadingImpactsPerField.values()) {
+              if (impacts.numLevels() > level) {
+                result = Math.min(result, impacts.getDocIdUpTo(level));
+              }
+            }
+
+            return result;
+          }

Review comment:
       Hi @jpountz , I just noticed a place in the implementation that was 
called multiple times, and thus tried to "cache" the result to save the 
repeated computation 
https://github.com/apache/lucene/pull/418/commits/6d5e7808fd847d5849e7c4ff3b552aeb9c196bbb.
 It turned out the improvement was quite obvious:
   
   Results from `python3 src/python/localrun.py -source combinedFieldsBig`
   
   Run 1
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
        CFQHighHigh        3.17      (3.5%)        2.70      (4.6%)  -14.8% ( 
-22% -   -6%) 0.000
           PKLookup      117.15      (6.7%)      116.31     (10.3%)   -0.7% ( 
-16% -   17%) 0.792
         CFQHighMed       12.98      (5.2%)       18.26     (13.9%)   40.7% (  
20% -   63%) 0.000
         CFQHighLow       17.30      (3.6%)       36.19     (20.1%)  109.2% (  
82% -  137%) 0.000
   ```
   
   Run 2
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
           PKLookup      104.98      (6.9%)      115.76      (8.5%)   10.3% (  
-4% -   27%) 0.000
         CFQHighMed        5.73      (3.9%)        6.53      (8.4%)   14.0% (   
1% -   27%) 0.000
         CFQHighLow       17.87      (3.4%)       20.37      (8.1%)   14.0% (   
2% -   26%) 0.000
        CFQHighHigh        5.41      (4.2%)        6.53      (8.9%)   20.7% (   
7% -   35%) 0.000
   ```
   
   Run 3
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
        CFQHighHigh        3.18      (3.6%)        2.66      (5.3%)  -16.4% ( 
-24% -   -7%) 0.000
         CFQHighMed        5.75      (4.0%)        5.07      (5.5%)  -11.7% ( 
-20% -   -2%) 0.000
           PKLookup      136.87      (5.9%)      133.97      (9.4%)   -2.1% ( 
-16% -   14%) 0.393
         CFQHighLow       23.19      (2.9%)       38.74     (16.0%)   67.0% (  
46% -   88%) 0.000
   ```
   
   Results from `python3 src/python/localrun.py -source 
combinedFieldsUnevenlyWeightedBig`
   
   Run 1
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
           PKLookup      102.30      (8.5%)      101.05     (15.1%)   -1.2% ( 
-22% -   24%) 0.753
         CFQHighMed        5.73      (6.7%)        6.61     (12.5%)   15.5% (  
-3% -   37%) 0.000
        CFQHighHigh        5.37      (6.2%)        7.28     (12.7%)   35.6% (  
15% -   58%) 0.000
         CFQHighLow       13.61      (6.8%)       27.02     (34.4%)   98.5% (  
53% -  150%) 0.000
   ```
   
   Run 2
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
         CFQHighMed        5.63      (4.0%)        4.87      (6.6%)  -13.5% ( 
-23% -   -2%) 0.000
           PKLookup       95.27      (9.3%)      102.84      (9.0%)    7.9% (  
-9% -   28%) 0.006
        CFQHighHigh        5.45      (3.7%)        7.43     (10.6%)   36.3% (  
21% -   52%) 0.000
         CFQHighLow       13.91      (4.6%)       29.42     (28.2%)  111.5% (  
75% -  151%) 0.000
   ```
   
   Run 3
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
           PKLookup       74.67     (39.5%)       97.84     (23.3%)   31.0% ( 
-22% -  155%) 0.002
        CFQHighHigh        5.61      (4.5%)        7.62     (10.0%)   35.8% (  
20% -   52%) 0.000
         CFQHighMed       10.44      (3.5%)       15.65     (23.8%)   50.0% (  
21% -   80%) 0.000
         CFQHighLow       14.00      (4.2%)       25.99     (33.0%)   85.6% (  
46% -  128%) 0.000
   ```
   
   However, this optimization doesn't help much for query `CFQHighHigh: at 
united +combinedFields=titleTokenized^4.0,body^2.0 # freq=2834104 
freq=1185528`, as the dense skip list for `at` still cause frequent max score 
computation (although giving the field `titleTokenized` more weight does indeed 
reduce perform hit from -42.0% to -34.3%). I tried to play with some different 
`numLevels` and `getDocIdUpTo` implementations, but so far I find it hard to 
balance the following two:
   1. lower docIdUpTo -> lower max score -> more frequent max score 
computation, but also more effective pruning due to lower max score
   2. higher docIdUpTo -> higher max score -> less frequent max score 
computation, but also less effective in pruning due to higher max score




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to