zacharymorn commented on a change in pull request #418:
URL: https://github.com/apache/lucene/pull/418#discussion_r752930339
##########
File path:
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -441,6 +491,273 @@ public boolean isCacheable(LeafReaderContext ctx) {
}
}
+ /** Merge impacts for combined field. */
+ static ImpactsSource mergeImpacts(
+ Map<String, List<ImpactsEnum>> fieldsWithImpactsEnums,
+ Map<String, List<Impacts>> fieldsWithImpacts,
+ Map<String, Float> fieldWeights) {
+ return new ImpactsSource() {
+
+ class SubIterator {
+ final Iterator<Impact> iterator;
+ int previousFreq;
+ Impact current;
+
+ SubIterator(Iterator<Impact> iterator) {
+ this.iterator = iterator;
+ this.current = iterator.next();
+ }
+
+ void next() {
+ previousFreq = current.freq;
+ if (iterator.hasNext() == false) {
+ current = null;
+ } else {
+ current = iterator.next();
+ }
+ }
+ }
+
+ @Override
+ public Impacts getImpacts() throws IOException {
+ // Use the impacts that have the lower next boundary (doc id in skip
entry) as a lead for
+ // each field
+ // They collectively will decide on the number of levels and the block
boundaries.
+ Map<String, Impacts> leadingImpactsPerField = new
HashMap<>(fieldsWithImpactsEnums.size());
+
+ for (Map.Entry<String, List<ImpactsEnum>> fieldImpacts :
+ fieldsWithImpactsEnums.entrySet()) {
+ String field = fieldImpacts.getKey();
+ List<ImpactsEnum> impactsEnums = fieldImpacts.getValue();
+ fieldsWithImpacts.put(field, new ArrayList<>(impactsEnums.size()));
+
+ Impacts tmpLead = null;
+ // find the impact that has the lowest next boundary for this field
+ for (int i = 0; i < impactsEnums.size(); ++i) {
+ Impacts impacts = impactsEnums.get(i).getImpacts();
+ fieldsWithImpacts.get(field).add(impacts);
+
+ if (tmpLead == null || impacts.getDocIdUpTo(0) <
tmpLead.getDocIdUpTo(0)) {
+ tmpLead = impacts;
+ }
+ }
+
+ leadingImpactsPerField.put(field, tmpLead);
+ }
+
+ return new Impacts() {
+
+ @Override
+ public int numLevels() {
+ // max of levels across fields' impactEnums
+ int result = 0;
+
+ for (Impacts impacts : leadingImpactsPerField.values()) {
+ result = Math.max(result, impacts.numLevels());
+ }
+
+ return result;
+ }
+
+ @Override
+ public int getDocIdUpTo(int level) {
+ // min of docIdUpTo across fields' impactEnums
+ int result = Integer.MAX_VALUE;
+
+ for (Impacts impacts : leadingImpactsPerField.values()) {
+ if (impacts.numLevels() > level) {
+ result = Math.min(result, impacts.getDocIdUpTo(level));
+ }
+ }
+
+ return result;
+ }
Review comment:
Hi @jpountz , I just noticed a place in the implementation that was
called multiple times, and thus tried to "cache" the result to save the
repeated computation
https://github.com/apache/lucene/pull/418/commits/6d5e7808fd847d5849e7c4ff3b552aeb9c196bbb.
It turned out the improvement was quite obvious:
Results from `python3 src/python/localrun.py -source combinedFieldsBig`
Run 1
```
TaskQPS baseline StdDevQPS my_modified_version StdDev
Pct diff p-value
CFQHighHigh 3.17 (3.5%) 2.70 (4.6%) -14.8% (
-22% - -6%) 0.000
PKLookup 117.15 (6.7%) 116.31 (10.3%) -0.7% (
-16% - 17%) 0.792
CFQHighMed 12.98 (5.2%) 18.26 (13.9%) 40.7% (
20% - 63%) 0.000
CFQHighLow 17.30 (3.6%) 36.19 (20.1%) 109.2% (
82% - 137%) 0.000
```
Run 2
```
TaskQPS baseline StdDevQPS my_modified_version StdDev
Pct diff p-value
PKLookup 104.98 (6.9%) 115.76 (8.5%) 10.3% (
-4% - 27%) 0.000
CFQHighMed 5.73 (3.9%) 6.53 (8.4%) 14.0% (
1% - 27%) 0.000
CFQHighLow 17.87 (3.4%) 20.37 (8.1%) 14.0% (
2% - 26%) 0.000
CFQHighHigh 5.41 (4.2%) 6.53 (8.9%) 20.7% (
7% - 35%) 0.000
```
Run 3
```
TaskQPS baseline StdDevQPS my_modified_version StdDev
Pct diff p-value
CFQHighHigh 3.18 (3.6%) 2.66 (5.3%) -16.4% (
-24% - -7%) 0.000
CFQHighMed 5.75 (4.0%) 5.07 (5.5%) -11.7% (
-20% - -2%) 0.000
PKLookup 136.87 (5.9%) 133.97 (9.4%) -2.1% (
-16% - 14%) 0.393
CFQHighLow 23.19 (2.9%) 38.74 (16.0%) 67.0% (
46% - 88%) 0.000
```
Results from `python3 src/python/localrun.py -source
combinedFieldsUnevenlyWeightedBig`
Run 1
```
TaskQPS baseline StdDevQPS my_modified_version StdDev
Pct diff p-value
PKLookup 102.30 (8.5%) 101.05 (15.1%) -1.2% (
-22% - 24%) 0.753
CFQHighMed 5.73 (6.7%) 6.61 (12.5%) 15.5% (
-3% - 37%) 0.000
CFQHighHigh 5.37 (6.2%) 7.28 (12.7%) 35.6% (
15% - 58%) 0.000
CFQHighLow 13.61 (6.8%) 27.02 (34.4%) 98.5% (
53% - 150%) 0.000
```
Run 2
```
TaskQPS baseline StdDevQPS my_modified_version StdDev
Pct diff p-value
CFQHighMed 5.63 (4.0%) 4.87 (6.6%) -13.5% (
-23% - -2%) 0.000
PKLookup 95.27 (9.3%) 102.84 (9.0%) 7.9% (
-9% - 28%) 0.006
CFQHighHigh 5.45 (3.7%) 7.43 (10.6%) 36.3% (
21% - 52%) 0.000
CFQHighLow 13.91 (4.6%) 29.42 (28.2%) 111.5% (
75% - 151%) 0.000
```
Run 3
```
TaskQPS baseline StdDevQPS my_modified_version StdDev
Pct diff p-value
PKLookup 74.67 (39.5%) 97.84 (23.3%) 31.0% (
-22% - 155%) 0.002
CFQHighHigh 5.61 (4.5%) 7.62 (10.0%) 35.8% (
20% - 52%) 0.000
CFQHighMed 10.44 (3.5%) 15.65 (23.8%) 50.0% (
21% - 80%) 0.000
CFQHighLow 14.00 (4.2%) 25.99 (33.0%) 85.6% (
46% - 128%) 0.000
```
However, this optimization doesn't help much for query `CFQHighHigh: at
united +combinedFields=titleTokenized^4.0,body^2.0 # freq=2834104
freq=1185528`, as the dense skip list for `at` still cause frequent max score
computation (although giving the field `titleTokenized` more weight does indeed
reduce perform hit from -42.0% to -34.3%). I tried to play with some different
`numLevels` and `getDocIdUpTo` implementations, but so far I find it hard to
balance the following two:
1. lower docIdUpTo -> lower max score -> more frequent max score
computation, but also more effective pruning due to lower max score
2. higher docIdUpTo -> higher max score -> less frequent max score
computation, but also less effective in pruning due to higher max score
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]