[
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388178#comment-17388178
]
Adrien Grand commented on LUCENE-10033:
---------------------------------------
I opened a PR with this idea. Queries that consume most values like the Browse*
faceting tasks become faster, but queries that only consume a small subset of
values like some sorting tasks (not all, on of them is faster) become slower.
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff p-value
HighTermMonthSort 101.33 (9.7%) 51.93 (2.8%)
-48.7% ( -55% - -40%) 0.000
TermDTSort 587.24 (6.1%) 404.20 (2.9%)
-31.2% ( -37% - -23%) 0.000
IntNRQ 85.55 (14.7%) 73.16 (1.6%)
-14.5% ( -26% - 2%) 0.000
OrHighNotMed 1301.37 (3.7%) 1218.64 (2.3%)
-6.4% ( -11% - 0%) 0.000
OrNotHighHigh 1121.91 (4.1%) 1089.27 (2.7%)
-2.9% ( -9% - 4%) 0.008
MedTerm 2156.71 (3.3%) 2103.32 (3.6%)
-2.5% ( -9% - 4%) 0.022
Fuzzy2 67.41 (4.6%) 65.74 (4.9%)
-2.5% ( -11% - 7%) 0.098
OrNotHighLow 1099.66 (3.7%) 1078.60 (3.0%)
-1.9% ( -8% - 4%) 0.073
MedIntervalsOrdered 79.39 (3.0%) 77.94 (3.7%)
-1.8% ( -8% - 5%) 0.088
MedPhrase 403.62 (2.8%) 397.19 (2.3%)
-1.6% ( -6% - 3%) 0.050
OrHighMed 130.57 (3.0%) 128.64 (2.6%)
-1.5% ( -6% - 4%) 0.099
LowIntervalsOrdered 20.82 (2.5%) 20.55 (3.4%)
-1.3% ( -6% - 4%) 0.167
HighIntervalsOrdered 2.95 (5.1%) 2.91 (5.8%)
-1.1% ( -11% - 10%) 0.530
OrHighLow 579.45 (2.9%) 574.45 (2.4%)
-0.9% ( -5% - 4%) 0.306
LowSpanNear 33.20 (2.9%) 33.06 (3.5%)
-0.4% ( -6% - 6%) 0.668
HighSpanNear 9.79 (3.5%) 9.79 (3.7%)
-0.0% ( -7% - 7%) 0.996
Respell 221.47 (2.1%) 221.62 (2.8%)
0.1% ( -4% - 4%) 0.931
HighSloppyPhrase 36.64 (3.4%) 36.69 (4.0%)
0.1% ( -7% - 7%) 0.915
Wildcard 283.85 (6.5%) 285.06 (7.2%)
0.4% ( -12% - 15%) 0.845
LowSloppyPhrase 175.77 (4.3%) 176.56 (4.4%)
0.5% ( -7% - 9%) 0.740
AndHighHigh 64.34 (2.5%) 64.84 (3.4%)
0.8% ( -5% - 6%) 0.410
HighTerm 2146.56 (3.3%) 2164.26 (4.5%)
0.8% ( -6% - 8%) 0.505
HighTermTitleBDVSort 27.18 (4.6%) 27.41 (2.1%)
0.8% ( -5% - 7%) 0.461
OrHighNotLow 1261.38 (2.3%) 1274.89 (3.0%)
1.1% ( -4% - 6%) 0.210
MedSpanNear 26.96 (4.1%) 27.28 (3.5%)
1.2% ( -6% - 9%) 0.336
MedSloppyPhrase 102.18 (4.7%) 103.51 (5.1%)
1.3% ( -8% - 11%) 0.399
BrowseDateTaxoFacets 3.15 (4.0%) 3.19 (4.0%)
1.4% ( -6% - 9%) 0.281
BrowseDayOfYearTaxoFacets 3.15 (4.0%) 3.20 (4.0%)
1.5% ( -6% - 9%) 0.250
AndHighLow 1295.59 (3.3%) 1318.11 (3.4%)
1.7% ( -4% - 8%) 0.105
Prefix3 63.21 (15.4%) 64.49 (17.1%)
2.0% ( -26% - 40%) 0.694
OrHighHigh 35.41 (3.1%) 36.24 (3.1%)
2.4% ( -3% - 8%) 0.015
Fuzzy1 253.74 (6.1%) 260.89 (7.1%)
2.8% ( -9% - 16%) 0.175
BrowseMonthTaxoFacets 3.42 (7.7%) 3.52 (4.1%)
2.9% ( -8% - 15%) 0.135
AndHighMed 164.48 (2.6%) 169.43 (3.3%)
3.0% ( -2% - 9%) 0.001
LowTerm 2645.26 (4.9%) 2752.43 (5.6%)
4.1% ( -6% - 15%) 0.015
OrHighNotHigh 1286.12 (3.7%) 1349.66 (4.6%)
4.9% ( -3% - 13%) 0.000
HighPhrase 105.61 (3.7%) 111.65 (4.8%)
5.7% ( -2% - 14%) 0.000
LowPhrase 35.85 (2.6%) 38.76 (3.3%)
8.1% ( 2% - 14%) 0.000
OrNotHighMed 1241.35 (3.1%) 1368.49 (3.6%)
10.2% ( 3% - 17%) 0.000
HighTermDayOfYearSort 573.92 (9.5%) 687.19 (7.9%)
19.7% ( 2% - 40%) 0.000
BrowseMonthSSDVFacets 11.52 (5.1%) 17.81 (23.5%)
54.6% ( 24% - 87%) 0.000
BrowseDayOfYearSSDVFacets 11.24 (3.9%) 18.15 (23.1%)
61.4% ( 33% - 91%) 0.000
{noformat}
> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
> Key: LUCENE-10033
> URL: https://issues.apache.org/jira/browse/LUCENE-10033
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread:
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where
> values can be decompressed independently, using DirectWriter/DirectReader.
> This is a bit inefficient in some cases, e.g. a single outlier can grow the
> number of bits per value for the entire block, we can't easily use run-length
> compression, etc. Plus, it encourages using a different sub-class for every
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with
> smaller blocks (e.g. 128 values) whose values get all decompressed at once
> (using SIMD instructions), with skip data within blocks in order to
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as
> today's doc values, and as discussed here for postings:
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]