[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312792#comment-17312792 ]
Greg Miller commented on LUCENE-9850: ------------------------------------- I gave this benchmark another run now that PFOR has been updated from 3 allowable exceptions to 7. As expected, the index size reduction is further improved, but the QPS regressions appear to get worse. Here's what I see: Note: Still using "-source wikimediumall" (wikimedium.10M.nostopwords.tasks). The doc ID payload portion of the index is reduced 11.9% (~3.3GB -> ~2.9GB). The overall index is reduced 3.3% (~11.6GB -> ~11.2GB). {code:java} BASELINE DOC ID BPV 0 **** [6.86 pct] (1529467 of 22287406) 1 * [0.00 pct] (91 of 22287406) 2 * [0.60 pct] (133848 of 22287406) 3 ** [2.09 pct] (466022 of 22287406) 4 ** [3.06 pct] (683006 of 22287406) 5 *** [4.44 pct] (990644 of 22287406) 6 *** [5.86 pct] (1305537 of 22287406) 7 ***** [8.38 pct] (1867660 of 22287406) 8 ***** [9.92 pct] (2211136 of 22287406) 9 ****** [10.79 pct] (2405504 of 22287406) 10 ***** [9.77 pct] (2178356 of 22287406) 11 ***** [8.61 pct] (1919968 of 22287406) 12 **** [7.63 pct] (1701251 of 22287406) 13 **** [6.40 pct] (1426872 of 22287406) 14 *** [4.94 pct] (1101624 of 22287406) 15 ** [3.62 pct] (806380 of 22287406) 16 ** [2.62 pct] (583235 of 22287406) 17 * [1.83 pct] (407402 of 22287406) 18 * [1.28 pct] (285690 of 22287406) 19 * [0.78 pct] (172866 of 22287406) 20 * [0.27 pct] (59108 of 22287406) 21 * [0.12 pct] (26582 of 22287406) 22 * [0.08 pct] (17481 of 22287406) 23 * [0.03 pct] (7676 of 22287406) 24 [0.00 pct] (0 of 22287406) 25 [0.00 pct] (0 of 22287406) 26 [0.00 pct] (0 of 22287406) 27 [0.00 pct] (0 of 22287406) 28 [0.00 pct] (0 of 22287406) 29 [0.00 pct] (0 of 22287406) 30 [0.00 pct] (0 of 22287406) 31 [0.00 pct] (0 of 22287406) Total bytes used: 3295496560 NEW CANDIDATE (PFOR doc IDs with 7 exceptions) DOC ID BPV 0 **** [7.07 pct] (1576532 of 22287406) 1 * [1.44 pct] (321744 of 22287406) 2 ** [3.74 pct] (834608 of 22287406) 3 *** [4.58 pct] (1019776 of 22287406) 4 *** [5.70 pct] (1271157 of 22287406) 5 **** [6.56 pct] (1463046 of 22287406) 6 ***** [9.28 pct] (2068438 of 22287406) 7 ***** [9.71 pct] (2163462 of 22287406) 8 ***** [9.41 pct] (2097645 of 22287406) 9 ***** [8.58 pct] (1911927 of 22287406) 10 ***** [8.08 pct] (1801505 of 22287406) 11 **** [6.92 pct] (1542164 of 22287406) 12 *** [5.52 pct] (1231201 of 22287406) 13 *** [4.30 pct] (957713 of 22287406) 14 ** [3.37 pct] (750159 of 22287406) 15 ** [2.38 pct] (531051 of 22287406) 16 * [1.65 pct] (367735 of 22287406) 17 * [1.15 pct] (255594 of 22287406) 18 * [0.52 pct] (116752 of 22287406) 19 * [0.02 pct] (5197 of 22287406) 20 [0.00 pct] (0 of 22287406) 21 [0.00 pct] (0 of 22287406) 22 [0.00 pct] (0 of 22287406) 23 [0.00 pct] (0 of 22287406) 24 [0.00 pct] (0 of 22287406) 25 [0.00 pct] (0 of 22287406) 26 [0.00 pct] (0 of 22287406) 27 [0.00 pct] (0 of 22287406) 28 [0.00 pct] (0 of 22287406) 29 [0.00 pct] (0 of 22287406) 30 [0.00 pct] (0 of 22287406) 31 [0.00 pct] (0 of 22287406) Total bytes used: 2904198119 {code} QPS regressions as follows: {code:java} TaskQPS baseline StdDevQPS pfordocids StdDev Pct diff p-value Prefix3 163.80 (13.3%) 145.05 (8.8%) -11.4% ( -29% - 12%) 0.001 AndHighMed 55.87 (4.5%) 51.35 (2.6%) -8.1% ( -14% - 0%) 0.000 LowSpanNear 8.15 (1.8%) 7.69 (1.8%) -5.6% ( -8% - -2%) 0.000 OrNotHighMed 511.04 (7.0%) 484.78 (5.0%) -5.1% ( -16% - 7%) 0.008 AndHighLow 295.02 (3.5%) 279.93 (3.1%) -5.1% ( -11% - 1%) 0.000 OrNotHighLow 516.68 (6.4%) 491.41 (4.7%) -4.9% ( -15% - 6%) 0.006 HighSpanNear 12.33 (2.0%) 11.74 (1.6%) -4.7% ( -8% - -1%) 0.000 OrNotHighHigh 398.33 (6.7%) 381.31 (6.8%) -4.3% ( -16% - 9%) 0.046 MedSpanNear 7.42 (2.0%) 7.14 (2.2%) -3.8% ( -7% - 0%) 0.000 Wildcard 148.87 (11.6%) 143.67 (10.4%) -3.5% ( -22% - 20%) 0.315 HighTermMonthSort 35.65 (15.2%) 34.48 (12.2%) -3.3% ( -26% - 28%) 0.454 AndHighHigh 17.88 (2.7%) 17.32 (2.9%) -3.1% ( -8% - 2%) 0.000 MedPhrase 11.16 (4.2%) 10.83 (3.2%) -3.0% ( -9% - 4%) 0.013 TermDTSort 41.80 (14.5%) 40.67 (12.0%) -2.7% ( -25% - 27%) 0.522 LowPhrase 38.27 (5.0%) 37.26 (4.5%) -2.6% ( -11% - 7%) 0.082 OrHighNotHigh 553.81 (8.1%) 541.01 (7.5%) -2.3% ( -16% - 14%) 0.347 MedSloppyPhrase 7.30 (2.1%) 7.14 (3.1%) -2.2% ( -7% - 3%) 0.008 OrHighMed 40.15 (3.4%) 39.27 (2.8%) -2.2% ( -8% - 4%) 0.027 OrHighHigh 7.29 (2.6%) 7.13 (2.8%) -2.2% ( -7% - 3%) 0.011 OrHighLow 166.87 (5.1%) 163.23 (4.3%) -2.2% ( -11% - 7%) 0.145 HighTermTitleBDVSort 18.65 (10.8%) 18.25 (12.6%) -2.1% ( -22% - 23%) 0.569 HighTermDayOfYearSort 30.11 (12.3%) 29.57 (11.8%) -1.8% ( -22% - 25%) 0.641 HighSloppyPhrase 5.12 (2.4%) 5.03 (3.6%) -1.7% ( -7% - 4%) 0.079 HighPhrase 113.33 (6.4%) 111.61 (6.0%) -1.5% ( -13% - 11%) 0.437 Fuzzy2 36.81 (7.1%) 36.33 (7.9%) -1.3% ( -15% - 14%) 0.584 HighIntervalsOrdered 8.65 (1.5%) 8.54 (1.7%) -1.2% ( -4% - 2%) 0.016 LowSloppyPhrase 68.84 (1.7%) 68.10 (2.4%) -1.1% ( -5% - 3%) 0.101 OrHighNotLow 517.95 (8.7%) 514.80 (7.0%) -0.6% ( -15% - 16%) 0.807 MedTerm 907.88 (6.4%) 902.40 (7.2%) -0.6% ( -13% - 13%) 0.779 Respell 31.17 (2.8%) 31.10 (2.7%) -0.2% ( -5% - 5%) 0.785 BrowseMonthSSDVFacets 3.15 (1.6%) 3.14 (1.1%) -0.2% ( -2% - 2%) 0.662 BrowseMonthTaxoFacets 1.40 (1.7%) 1.40 (1.2%) 0.2% ( -2% - 3%) 0.694 HighTerm 713.93 (4.2%) 715.28 (4.6%) 0.2% ( -8% - 9%) 0.893 OrHighNotMed 445.75 (7.5%) 446.79 (9.4%) 0.2% ( -15% - 18%) 0.931 BrowseDayOfYearTaxoFacets 1.21 (2.7%) 1.21 (2.6%) 0.3% ( -4% - 5%) 0.761 BrowseDateTaxoFacets 1.21 (2.5%) 1.22 (2.4%) 0.3% ( -4% - 5%) 0.710 BrowseDayOfYearSSDVFacets 2.89 (1.5%) 2.90 (1.1%) 0.4% ( -2% - 2%) 0.352 PKLookup 128.08 (5.8%) 128.60 (5.7%) 0.4% ( -10% - 12%) 0.822 IntNRQ 15.95 (18.3%) 16.05 (18.9%) 0.6% ( -30% - 46%) 0.914 LowTerm 906.79 (5.0%) 925.27 (4.9%) 2.0% ( -7% - 12%) 0.193 Fuzzy1 38.59 (6.4%) 39.40 (6.5%) 2.1% ( -10% - 16%) 0.301 {code} I'd love to find a way to cut down on this QPS regression since there's a decent index size reduction to be had here. I'll have to see if I can figure out any way to further optimize this. > Explore PFOR for Doc ID delta encoding (instead of FOR) > ------------------------------------------------------- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs > Affects Versions: main (9.0) > Reporter: Greg Miller > Priority: Minor > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org