Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1825228925
After some tweaking and tinkering I was finally able to get the bench
running the way I wanted in luceneutil! @mikemccand
Unfortunately luceneutil out of the box doesn't work for my case...
```
TaskQPS baseline StdDevQPS
my_modified_version StdDev Pct diff p-value
[19/1802]
Wildcard 56.20 (2.0%) 7.30
(0.2%) -87.0% ( -87% - -86%) 0.000
Respell 44.00 (1.7%) 14.46
(0.6%) -67.1% ( -68% - -65%) 0.000
Fuzzy1 54.11 (1.2%) 20.86
(0.8%) -61.4% ( -62% - -60%) 0.000
Prefix3 103.61 (0.8%) 41.37
(0.5%) -60.1% ( -60% - -59%) 0.000
Fuzzy2 42.43 (1.2%) 20.26
(0.9%) -52.3% ( -53% - -50%) 0.000
HighTermTitleSort 127.60 (1.7%) 114.65
(1.6%) -10.1% ( -13% - -6%) 0.000
HighTermMonthSort 2549.22 (2.8%) 2435.74
(3.8%) -4.5% ( -10% - 2%) 0.037
AndHighLow 708.99 (2.3%) 678.15
(2.1%) -4.4% ( -8% - 0%) 0.001
LowTerm 369.16 (5.5%) 358.15
(3.1%) -3.0% ( -10% - 5%) 0.287
OrNotHighLow 258.51 (1.8%) 252.11
(2.2%) -2.5% ( -6% - 1%) 0.054
OrHighLow 348.58 (1.1%) 340.11
(2.5%) -2.4% ( -5% - 1%) 0.046
OrNotHighHigh 141.31 (6.7%) 138.54
(3.2%) -2.0% ( -11% - 8%) 0.554
IntNRQ 17.93 (1.8%) 17.62
(3.7%) -1.7% ( -7% - 3%) 0.349
MedSloppyPhrase 26.73 (0.8%) 26.28
(1.3%) -1.7% ( -3% - 0%) 0.017
HighIntervalsOrdered 3.88 (2.6%) 3.82
(2.3%) -1.6% ( -6% - 3%) 0.314
HighTerm 290.77 (7.3%) 286.20
(6.9%) -1.6% ( -14% - 13%) 0.727
OrHighNotHigh 296.51 (6.9%) 291.90
(4.9%) -1.6% ( -12% - 11%) 0.682
LowIntervalsOrdered 15.48 (1.1%) 15.26
(1.3%) -1.4% ( -3% - 0%) 0.057
MedTerm 405.54 (6.3%) 399.77
(5.9%) -1.4% ( -12% - 11%) 0.713
LowSloppyPhrase 44.64 (0.5%) 44.04
(2.1%) -1.3% ( -3% - 1%) 0.171
HighTermDayOfYearSort 177.36 (1.6%) 175.08
(1.8%) -1.3% ( -4% - 2%) 0.234
OrHighMed 79.44 (3.7%) 78.48
(3.0%) -1.2% ( -7% - 5%) 0.572
OrNotHighMed 268.82 (4.3%) 265.74
(3.1%) -1.1% ( -8% - 6%) 0.632
BrowseMonthTaxoFacets 3.89 (0.4%) 3.85
(1.2%) -0.9% ( -2% - 0%) 0.134
AndHighMedDayTaxoFacets 44.33 (0.7%) 43.96
(1.1%) -0.8% ( -2% - 0%) 0.145
MedSpanNear 30.67 (1.1%) 30.42
(2.3%) -0.8% ( -4% - 2%) 0.481
LowSpanNear 4.56 (0.8%) 4.53
(2.1%) -0.6% ( -3% - 2%) 0.576
HighSpanNear 8.52 (1.4%) 8.47
(2.1%) -0.5% ( -3% - 2%) 0.636
OrHighNotLow 236.32 (6.6%) 235.28
(4.4%) -0.4% ( -10% - 11%) 0.901
AndHighHighDayTaxoFacets 4.31 (0.6%) 4.29
(0.8%) -0.4% ( -1% - 1%) 0.446
LowPhrase 69.90 (1.2%) 69.67
(2.7%) -0.3% ( -4% - 3%) 0.807
AndHighMed 41.87 (0.6%) 41.75
(2.7%) -0.3% ( -3% - 3%) 0.828
OrHighHigh 45.86 (7.4%) 45.77
(7.9%) -0.2% ( -14% - 16%) 0.969
TermDTSort 101.90 (2.3%) 101.84
(2.0%) -0.1% ( -4% - 4%) 0.966
HighSloppyPhrase 0.48 (2.2%) 0.48
(2.6%) -0.0% ( -4% - 4%) 0.995
BrowseDateSSDVFacets 1.00 (3.7%) 1.00
(5.3%) -0.0% ( -8% - 9%) 1.000
HighTermTitleBDVSort 4.48 (5.4%) 4.49
(4.4%) 0.0% ( -9% - 10%) 0.990
BrowseDayOfYearSSDVFacets 3.51 (13.4%) 3.52
(13.3%) 0.1% ( -23% - 30%) 0.990
BrowseMonthSSDVFacets 3.38 (0.7%) 3.39
(0.6%) 0.1% ( -1% - 1%) 0.740
BrowseDayOfYearTaxoFacets 3.85 (0.3%) 3.85
(0.5%) 0.2% ( 0% - 0%) 0.469
MedIntervalsOrdered 1.65 (1.5%) 1.65
(1.5%) 0.2% ( -2% - 3%) 0.802
BrowseDateTaxoFacets 3.82 (0.3%) 3.83
(0.4%) 0.3% ( 0% - 0%) 0.264
BrowseRandomLabelSSDVFacets 2.34 (1.0%) 2.35
(0.5%) 0.3% ( -1% - 1%) 0.494
OrHighNotMed 296.34 (7.6%) 297.38
(4.6%) 0.4% ( -11% - 13%) 0.930
BrowseRandomLabelTaxoFacets 3.29 (0.6%) 3.31
(0.7%) 0.5% ( 0% - 1%) 0.275
MedPhrase 9.84 (2.1%) 9.90
(4.1%) 0.6% ( -5% - 6%) 0.787
AndHighHigh 33.50 (1.2%) 33.73
(3.4%) 0.7% ( -3% - 5%) 0.669
HighPhrase 15.15 (2.3%) 15.28
(4.1%) 0.9% ( -5% - 7%) 0.679
MedTermDayTaxoFacets 13.36 (1.8%) 13.55
(2.0%) 1.4% ( -2% - 5%) 0.240
OrHighMedDayTaxoFacets 3.73 (1.4%) 3.83
(2.2%) 2.6% ( 0% - 6%) 0.026
PKLookup 147.84 (1.6%) 157.37
(1.5%) 6.4% ( 3% - 9%) 0.000
```
### Observations:
#### PKLookup has improvement
This is reasonable as the terms index (FST) holds all the terms.
#### Fuzzy/Wildcard/Prefix queries got *much slower*
This is also expected because currently I used the default implementation
provided by `TermsEnum` which does not take advantage of the FST. With an
optimized implementation I expect it to at least be on-par and slightly better
because the FST holds information about all terms, whereas the current
BlockTreeTerms only holds prefixes.
#### `HighTermTitleSort` and `HighTermMonthSort` got about 4.5% ~ 10% less
throughput
I don't quite understand why term lookup could
#### `AndHighLow` got slower
Am i missing some optimization opportunity for low freq terms?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]