[ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388178#comment-17388178 ]
Adrien Grand commented on LUCENE-10033: --------------------------------------- I opened a PR with this idea. Queries that consume most values like the Browse* faceting tasks become faster, but queries that only consume a small subset of values like some sorting tasks (not all, on of them is faster) become slower. {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff p-value HighTermMonthSort 101.33 (9.7%) 51.93 (2.8%) -48.7% ( -55% - -40%) 0.000 TermDTSort 587.24 (6.1%) 404.20 (2.9%) -31.2% ( -37% - -23%) 0.000 IntNRQ 85.55 (14.7%) 73.16 (1.6%) -14.5% ( -26% - 2%) 0.000 OrHighNotMed 1301.37 (3.7%) 1218.64 (2.3%) -6.4% ( -11% - 0%) 0.000 OrNotHighHigh 1121.91 (4.1%) 1089.27 (2.7%) -2.9% ( -9% - 4%) 0.008 MedTerm 2156.71 (3.3%) 2103.32 (3.6%) -2.5% ( -9% - 4%) 0.022 Fuzzy2 67.41 (4.6%) 65.74 (4.9%) -2.5% ( -11% - 7%) 0.098 OrNotHighLow 1099.66 (3.7%) 1078.60 (3.0%) -1.9% ( -8% - 4%) 0.073 MedIntervalsOrdered 79.39 (3.0%) 77.94 (3.7%) -1.8% ( -8% - 5%) 0.088 MedPhrase 403.62 (2.8%) 397.19 (2.3%) -1.6% ( -6% - 3%) 0.050 OrHighMed 130.57 (3.0%) 128.64 (2.6%) -1.5% ( -6% - 4%) 0.099 LowIntervalsOrdered 20.82 (2.5%) 20.55 (3.4%) -1.3% ( -6% - 4%) 0.167 HighIntervalsOrdered 2.95 (5.1%) 2.91 (5.8%) -1.1% ( -11% - 10%) 0.530 OrHighLow 579.45 (2.9%) 574.45 (2.4%) -0.9% ( -5% - 4%) 0.306 LowSpanNear 33.20 (2.9%) 33.06 (3.5%) -0.4% ( -6% - 6%) 0.668 HighSpanNear 9.79 (3.5%) 9.79 (3.7%) -0.0% ( -7% - 7%) 0.996 Respell 221.47 (2.1%) 221.62 (2.8%) 0.1% ( -4% - 4%) 0.931 HighSloppyPhrase 36.64 (3.4%) 36.69 (4.0%) 0.1% ( -7% - 7%) 0.915 Wildcard 283.85 (6.5%) 285.06 (7.2%) 0.4% ( -12% - 15%) 0.845 LowSloppyPhrase 175.77 (4.3%) 176.56 (4.4%) 0.5% ( -7% - 9%) 0.740 AndHighHigh 64.34 (2.5%) 64.84 (3.4%) 0.8% ( -5% - 6%) 0.410 HighTerm 2146.56 (3.3%) 2164.26 (4.5%) 0.8% ( -6% - 8%) 0.505 HighTermTitleBDVSort 27.18 (4.6%) 27.41 (2.1%) 0.8% ( -5% - 7%) 0.461 OrHighNotLow 1261.38 (2.3%) 1274.89 (3.0%) 1.1% ( -4% - 6%) 0.210 MedSpanNear 26.96 (4.1%) 27.28 (3.5%) 1.2% ( -6% - 9%) 0.336 MedSloppyPhrase 102.18 (4.7%) 103.51 (5.1%) 1.3% ( -8% - 11%) 0.399 BrowseDateTaxoFacets 3.15 (4.0%) 3.19 (4.0%) 1.4% ( -6% - 9%) 0.281 BrowseDayOfYearTaxoFacets 3.15 (4.0%) 3.20 (4.0%) 1.5% ( -6% - 9%) 0.250 AndHighLow 1295.59 (3.3%) 1318.11 (3.4%) 1.7% ( -4% - 8%) 0.105 Prefix3 63.21 (15.4%) 64.49 (17.1%) 2.0% ( -26% - 40%) 0.694 OrHighHigh 35.41 (3.1%) 36.24 (3.1%) 2.4% ( -3% - 8%) 0.015 Fuzzy1 253.74 (6.1%) 260.89 (7.1%) 2.8% ( -9% - 16%) 0.175 BrowseMonthTaxoFacets 3.42 (7.7%) 3.52 (4.1%) 2.9% ( -8% - 15%) 0.135 AndHighMed 164.48 (2.6%) 169.43 (3.3%) 3.0% ( -2% - 9%) 0.001 LowTerm 2645.26 (4.9%) 2752.43 (5.6%) 4.1% ( -6% - 15%) 0.015 OrHighNotHigh 1286.12 (3.7%) 1349.66 (4.6%) 4.9% ( -3% - 13%) 0.000 HighPhrase 105.61 (3.7%) 111.65 (4.8%) 5.7% ( -2% - 14%) 0.000 LowPhrase 35.85 (2.6%) 38.76 (3.3%) 8.1% ( 2% - 14%) 0.000 OrNotHighMed 1241.35 (3.1%) 1368.49 (3.6%) 10.2% ( 3% - 17%) 0.000 HighTermDayOfYearSort 573.92 (9.5%) 687.19 (7.9%) 19.7% ( 2% - 40%) 0.000 BrowseMonthSSDVFacets 11.52 (5.1%) 17.81 (23.5%) 54.6% ( 24% - 87%) 0.000 BrowseDayOfYearSSDVFacets 11.24 (3.9%) 18.15 (23.1%) 61.4% ( 33% - 91%) 0.000 {noformat} > Encode doc values in smaller blocks of values, like postings > ------------------------------------------------------------ > > Key: LUCENE-10033 > URL: https://issues.apache.org/jira/browse/LUCENE-10033 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > This is a follow-up to the discussion on this thread: > https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E. > Our current approach for doc values uses large blocks of 16k values where > values can be decompressed independently, using DirectWriter/DirectReader. > This is a bit inefficient in some cases, e.g. a single outlier can grow the > number of bits per value for the entire block, we can't easily use run-length > compression, etc. Plus, it encourages using a different sub-class for every > compression technique, which puts pressure on the JVM. > We'd like to move to an approach that would be more similar to postings with > smaller blocks (e.g. 128 values) whose values get all decompressed at once > (using SIMD instructions), with skip data within blocks in order to > efficiently skip to arbitrary doc IDs (or maybe still use jump tables as > today's doc values, and as discussed here for postings: > https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org