zacharymorn commented on a change in pull request #418:
URL: https://github.com/apache/lucene/pull/418#discussion_r754091056



##########
File path: 
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -441,6 +491,273 @@ public boolean isCacheable(LeafReaderContext ctx) {
     }
   }
 
+  /** Merge impacts for combined field. */
+  static ImpactsSource mergeImpacts(
+      Map<String, List<ImpactsEnum>> fieldsWithImpactsEnums,
+      Map<String, List<Impacts>> fieldsWithImpacts,
+      Map<String, Float> fieldWeights) {
+    return new ImpactsSource() {
+
+      class SubIterator {
+        final Iterator<Impact> iterator;
+        int previousFreq;
+        Impact current;
+
+        SubIterator(Iterator<Impact> iterator) {
+          this.iterator = iterator;
+          this.current = iterator.next();
+        }
+
+        void next() {
+          previousFreq = current.freq;
+          if (iterator.hasNext() == false) {
+            current = null;
+          } else {
+            current = iterator.next();
+          }
+        }
+      }
+
+      @Override
+      public Impacts getImpacts() throws IOException {
+        // Use the impacts that have the lower next boundary (doc id in skip 
entry) as a lead for
+        // each field
+        // They collectively will decide on the number of levels and the block 
boundaries.
+        Map<String, Impacts> leadingImpactsPerField = new 
HashMap<>(fieldsWithImpactsEnums.size());
+
+        for (Map.Entry<String, List<ImpactsEnum>> fieldImpacts :
+            fieldsWithImpactsEnums.entrySet()) {
+          String field = fieldImpacts.getKey();
+          List<ImpactsEnum> impactsEnums = fieldImpacts.getValue();
+          fieldsWithImpacts.put(field, new ArrayList<>(impactsEnums.size()));
+
+          Impacts tmpLead = null;
+          // find the impact that has the lowest next boundary for this field
+          for (int i = 0; i < impactsEnums.size(); ++i) {
+            Impacts impacts = impactsEnums.get(i).getImpacts();
+            fieldsWithImpacts.get(field).add(impacts);
+
+            if (tmpLead == null || impacts.getDocIdUpTo(0) < 
tmpLead.getDocIdUpTo(0)) {
+              tmpLead = impacts;
+            }
+          }
+
+          leadingImpactsPerField.put(field, tmpLead);
+        }
+
+        return new Impacts() {
+
+          @Override
+          public int numLevels() {
+            // max of levels across fields' impactEnums
+            int result = 0;
+
+            for (Impacts impacts : leadingImpactsPerField.values()) {
+              result = Math.max(result, impacts.numLevels());
+            }
+
+            return result;
+          }
+
+          @Override
+          public int getDocIdUpTo(int level) {
+            // min of docIdUpTo across fields' impactEnums
+            int result = Integer.MAX_VALUE;
+
+            for (Impacts impacts : leadingImpactsPerField.values()) {
+              if (impacts.numLevels() > level) {
+                result = Math.min(result, impacts.getDocIdUpTo(level));
+              }
+            }
+
+            return result;
+          }

Review comment:
       Thanks @jpountz for the confirmation and clarification! I gave these 
ideas a try in this commit 
https://github.com/apache/lucene/pull/418/commits/db2446f0337e5860bb8c731df1f6c16281efb6bb,
 with the following two similar approaches:
   1. Use the impacts of a) highest weight field + b) lowest doc freq term 
within that field (commented out)
   2. Use the impacts of a) highest weight field + b) lowest `getDocIdUpTo` 
term within that field
   
   Here are the results from `python3 src/python/localrun.py -source 
combinedFieldsUnevenlyWeightedBig`
   
   Run 1:
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
         CFQHighMed        5.46      (7.7%)        4.58      (7.1%)  -16.1% ( 
-28% -   -1%) 0.000
         CFQHighLow       21.83      (7.7%)       18.86      (7.2%)  -13.6% ( 
-26% -    1%) 0.000
        CFQHighHigh        3.44      (7.4%)        3.06      (7.4%)  -11.1% ( 
-24% -    4%) 0.000
           PKLookup      109.63     (11.8%)      109.67     (13.5%)    0.0% ( 
-22% -   28%) 0.993
   ```
   
   Run 2: 
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
         CFQHighLow       17.22     (10.2%)       11.73     (12.0%)  -31.9% ( 
-49% -  -10%) 0.000
         CFQHighMed       10.20     (10.5%)        8.36      (9.3%)  -18.1% ( 
-34% -    1%) 0.000
        CFQHighHigh        3.65      (6.6%)        3.11      (5.3%)  -14.8% ( 
-25% -   -3%) 0.000
           PKLookup       91.80     (26.0%)      100.13     (16.3%)    9.1% ( 
-26% -   69%) 0.187
   ```
   
   Run 3:
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
         CFQHighMed        5.62      (8.1%)        4.86      (8.0%)  -13.5% ( 
-27% -    2%) 0.000
        CFQHighHigh        4.13      (6.9%)        3.59      (5.8%)  -13.0% ( 
-24% -    0%) 0.000
         CFQHighLow       17.00      (8.2%)       15.06      (9.7%)  -11.4% ( 
-27% -    7%) 0.000
           PKLookup       99.61      (6.4%)      102.96      (6.7%)    3.4% (  
-9% -   17%) 0.104
   ```
   
   and their CPU profiling looks very similar to that of baseline, suggesting 
that pruning was not very effective there.
   
   Interestingly, when I flipped the weights assigned to `body` and 
`titleTokenized` fields in the task (assigned 20.0 to `body` field, and 2.0 to 
`titleTokenized` field), I got the following result:
   
   Run 1:
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
           PKLookup       99.64      (9.1%)       94.80     (11.6%)   -4.9% ( 
-23% -   17%) 0.142
         CFQHighLow       17.31      (7.2%)       19.16     (15.0%)   10.7% ( 
-10% -   35%) 0.004
         CFQHighMed        3.41      (6.4%)        4.57     (13.6%)   34.2% (  
13% -   57%) 0.000
        CFQHighHigh        3.28      (6.3%)        5.32     (16.8%)   62.2% (  
36% -   91%) 0.000
   ```
   
   Run 2:
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
        CFQHighHigh        4.10      (6.3%)        2.72      (5.8%)  -33.6% ( 
-42% -  -23%) 0.000
         CFQHighMed        5.69      (5.6%)        4.48      (8.2%)  -21.4% ( 
-33% -   -8%) 0.000
           PKLookup      104.53     (21.1%)      104.06     (16.8%)   -0.5% ( 
-31% -   47%) 0.940
         CFQHighLow       22.52      (7.4%)       35.31     (21.3%)   56.8% (  
26% -   92%) 0.000
   ```
   
   Run 3:
   ```
               TaskQPS baseline      StdDevQPS my_modified_version      StdDev  
              Pct diff p-value 
           PKLookup       80.39     (31.2%)       93.55     (19.4%)   16.4% ( 
-26% -   97%) 0.046
        CFQHighHigh        5.71      (6.6%)        6.96     (13.4%)   21.7% (   
1% -   44%) 0.000
         CFQHighMed       10.21      (5.6%)       12.62     (22.0%)   23.6% (  
-3% -   54%) 0.000
         CFQHighLow       16.89      (4.2%)       27.39     (22.9%)   62.1% (  
33% -   93%) 0.000
   ```
   
   I think what might have happened here is, as highest weighted field / lowest 
doc freq term tends to have fewer terms per doc, by definition their matches 
are also more sparse in the corpus, compared with field with lower weight (i.e. 
there are more docs with a specific term showing up in its `body` field, 
compared with `titleTokenized` field).  Hence when returning `getDocIdUpTo` of 
the field that has the highest weight, a larger bound will be returned 
(probably close to `Math.max`?), causing max score to be much higher and less 
effective in pruning.
   
   What do you think of the commit and results? Is there anything I may have 
missed?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to