zacharymorn commented on a change in pull request #418: URL: https://github.com/apache/lucene/pull/418#discussion_r759006970
########## File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java ########## @@ -441,6 +491,273 @@ public boolean isCacheable(LeafReaderContext ctx) { } } + /** Merge impacts for combined field. */ + static ImpactsSource mergeImpacts( + Map<String, List<ImpactsEnum>> fieldsWithImpactsEnums, + Map<String, List<Impacts>> fieldsWithImpacts, + Map<String, Float> fieldWeights) { + return new ImpactsSource() { + + class SubIterator { + final Iterator<Impact> iterator; + int previousFreq; + Impact current; + + SubIterator(Iterator<Impact> iterator) { + this.iterator = iterator; + this.current = iterator.next(); + } + + void next() { + previousFreq = current.freq; + if (iterator.hasNext() == false) { + current = null; + } else { + current = iterator.next(); + } + } + } + + @Override + public Impacts getImpacts() throws IOException { + // Use the impacts that have the lower next boundary (doc id in skip entry) as a lead for + // each field + // They collectively will decide on the number of levels and the block boundaries. + Map<String, Impacts> leadingImpactsPerField = new HashMap<>(fieldsWithImpactsEnums.size()); + + for (Map.Entry<String, List<ImpactsEnum>> fieldImpacts : + fieldsWithImpactsEnums.entrySet()) { + String field = fieldImpacts.getKey(); + List<ImpactsEnum> impactsEnums = fieldImpacts.getValue(); + fieldsWithImpacts.put(field, new ArrayList<>(impactsEnums.size())); + + Impacts tmpLead = null; + // find the impact that has the lowest next boundary for this field + for (int i = 0; i < impactsEnums.size(); ++i) { + Impacts impacts = impactsEnums.get(i).getImpacts(); + fieldsWithImpacts.get(field).add(impacts); + + if (tmpLead == null || impacts.getDocIdUpTo(0) < tmpLead.getDocIdUpTo(0)) { + tmpLead = impacts; + } + } + + leadingImpactsPerField.put(field, tmpLead); + } + + return new Impacts() { + + @Override + public int numLevels() { + // max of levels across fields' impactEnums + int result = 0; + + for (Impacts impacts : leadingImpactsPerField.values()) { + result = Math.max(result, impacts.numLevels()); + } + + return result; + } + + @Override + public int getDocIdUpTo(int level) { + // min of docIdUpTo across fields' impactEnums + int result = Integer.MAX_VALUE; + + for (Impacts impacts : leadingImpactsPerField.values()) { + if (impacts.numLevels() > level) { + result = Math.min(result, impacts.getDocIdUpTo(level)); + } + } + + return result; + } Review comment: I re-ran the perf tests `combinedFieldsUnevenlyWeightedBig` after bug fix https://github.com/apache/lucene/pull/418/commits/3d0a215d0174f13b508560bd476cd68475f05864, and the results look similar: With commit `test leadImpact from highest weighted field` https://github.com/apache/lucene/pull/418/commits/db2446f0337e5860bb8c731df1f6c16281efb6bb Run 1: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHigh 2.81 (10.9%) 2.40 (10.3%) -14.6% ( -32% - 7%) 0.000 CFQHighMed 3.27 (11.8%) 2.91 (11.0%) -11.2% ( -30% - 13%) 0.002 CFQHighLow 16.09 (14.6%) 14.58 (11.9%) -9.4% ( -31% - 20%) 0.025 PKLookup 92.08 (11.6%) 96.23 (13.3%) 4.5% ( -18% - 33%) 0.255 ``` Run 2: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighLow 15.76 (12.0%) 10.19 (12.9%) -35.3% ( -53% - -11%) 0.000 CFQHighMed 12.01 (13.3%) 8.79 (13.9%) -26.9% ( -47% - 0%) 0.000 CFQHighHigh 5.59 (10.8%) 4.40 (8.4%) -21.2% ( -36% - -2%) 0.000 PKLookup 93.36 (21.9%) 94.29 (22.6%) 1.0% ( -35% - 58%) 0.888 ``` --- Without commit `test leadImpact from highest weighted field` https://github.com/apache/lucene/pull/418/commits/db2446f0337e5860bb8c731df1f6c16281efb6bb Run 1: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighLow 16.69 (7.9%) 15.02 (14.1%) -10.0% ( -29% - 13%) 0.006 PKLookup 94.38 (9.7%) 105.59 (13.3%) 11.9% ( -10% - 38%) 0.001 CFQHighHigh 5.51 (7.0%) 7.59 (13.5%) 37.6% ( 15% - 62%) 0.000 CFQHighMed 10.24 (7.2%) 15.23 (20.0%) 48.8% ( 20% - 81%) 0.000 ``` Run 2: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 94.85 (11.6%) 92.29 (11.0%) -2.7% ( -22% - 22%) 0.448 CFQHighHigh 5.26 (13.7%) 6.25 (23.4%) 18.9% ( -16% - 64%) 0.002 CFQHighMed 3.39 (16.1%) 4.55 (29.7%) 34.3% ( -9% - 95%) 0.000 CFQHighLow 16.04 (14.9%) 29.22 (39.6%) 82.2% ( 24% - 160%) 0.000 ``` Run 3: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 68.96 (40.4%) 67.26 (40.1%) -2.5% ( -59% - 130%) 0.847 CFQHighHigh 4.85 (16.7%) 5.62 (33.8%) 15.7% ( -29% - 79%) 0.062 CFQHighMed 10.67 (20.1%) 12.47 (42.7%) 16.9% ( -38% - 99%) 0.110 CFQHighLow 14.07 (22.6%) 23.18 (68.7%) 64.8% ( -21% - 201%) 0.000 ``` I guess the scoring function has taken care of / bounded the erroneous norm used in approximation earlier as it was effectively `Long.MIN_VALUE`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org