[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308216#comment-17308216 ]
Greg Miller commented on LUCENE-9850: ------------------------------------- I ran a luceneutil benchmark comparing my PFOR approach to encoding doc ID deltas (available [here|https://github.com/gsmiller/lucene/tree/LUCENE-9850/pfordocids]) to the main branch. Here are the results. This is the first luceneutil benchmark I've run, so I'm still getting familiar with the tool and interpreting results. This was run with the "wikimediumall" source. If I'm interpreting these results correctly, it looks like there is a pretty material performance penalty to using PFOR instead of FOR, but I'd be curious what other, more experienced folks see in these results. I'll see if I can get some figures on the index size difference as well, but I'm not sure there's a good path forward here with these QPS results. {code:java} TaskQPS baseline StdDevQPS pfor doc ids StdDev Pct diff p-value TermDTSort 38.02 (11.8%) 36.08 (8.9%) -5.1% ( -23% - 17%) 0.123 OrNotHighLow 488.43 (5.8%) 466.01 (6.2%) -4.6% ( -15% - 7%) 0.016 HighTerm 1276.94 (5.0%) 1222.31 (5.3%) -4.3% ( -13% - 6%) 0.009 HighTermDayOfYearSort 51.64 (11.6%) 49.66 (8.0%) -3.8% ( -20% - 17%) 0.223 HighTermMonthSort 59.36 (10.6%) 57.09 (11.4%) -3.8% ( -23% - 20%) 0.272 HighTermTitleBDVSort 36.61 (16.4%) 35.27 (19.2%) -3.7% ( -33% - 38%) 0.517 AndHighHigh 11.06 (3.7%) 10.67 (3.0%) -3.5% ( -9% - 3%) 0.001 OrHighNotHigh 568.03 (10.4%) 548.46 (7.7%) -3.4% ( -19% - 16%) 0.233 OrHighLow 261.36 (3.9%) 252.58 (3.7%) -3.4% ( -10% - 4%) 0.005 AndHighMed 82.45 (3.1%) 79.71 (3.1%) -3.3% ( -9% - 2%) 0.001 MedPhrase 40.33 (5.4%) 39.02 (4.7%) -3.2% ( -12% - 7%) 0.043 Wildcard 25.19 (2.8%) 24.46 (2.7%) -2.9% ( -8% - 2%) 0.001 LowSpanNear 5.52 (2.0%) 5.36 (2.3%) -2.9% ( -6% - 1%) 0.000 AndHighLow 203.23 (2.9%) 197.52 (2.6%) -2.8% ( -8% - 2%) 0.001 OrHighMed 19.99 (2.0%) 19.43 (2.1%) -2.8% ( -6% - 1%) 0.000 MedTerm 829.73 (6.4%) 807.65 (5.1%) -2.7% ( -13% - 9%) 0.144 OrHighNotLow 482.63 (4.8%) 469.91 (5.5%) -2.6% ( -12% - 8%) 0.105 OrHighHigh 9.20 (2.0%) 8.97 (2.3%) -2.5% ( -6% - 1%) 0.000 LowPhrase 16.16 (3.3%) 15.76 (2.7%) -2.5% ( -8% - 3%) 0.009 MedSpanNear 3.14 (2.1%) 3.07 (2.3%) -2.3% ( -6% - 2%) 0.001 Prefix3 121.86 (8.5%) 119.12 (6.5%) -2.2% ( -15% - 13%) 0.349 OrNotHighMed 477.93 (6.0%) 467.27 (6.7%) -2.2% ( -14% - 11%) 0.268 HighSpanNear 9.24 (2.2%) 9.05 (2.1%) -2.0% ( -6% - 2%) 0.004 MedSloppyPhrase 16.95 (2.9%) 16.67 (3.0%) -1.7% ( -7% - 4%) 0.069 IntNRQ 49.47 (2.6%) 48.88 (1.6%) -1.2% ( -5% - 3%) 0.087 LowSloppyPhrase 30.67 (2.7%) 30.33 (2.8%) -1.1% ( -6% - 4%) 0.198 LowTerm 984.89 (4.7%) 973.96 (3.1%) -1.1% ( -8% - 7%) 0.380 OrNotHighHigh 476.25 (8.3%) 471.56 (7.9%) -1.0% ( -15% - 16%) 0.701 HighIntervalsOrdered 4.20 (2.2%) 4.18 (2.4%) -0.7% ( -5% - 3%) 0.347 OrHighNotMed 445.69 (5.1%) 443.29 (5.6%) -0.5% ( -10% - 10%) 0.750 BrowseMonthTaxoFacets 1.41 (1.2%) 1.41 (1.4%) -0.3% ( -2% - 2%) 0.427 PKLookup 127.78 (3.2%) 127.46 (2.9%) -0.3% ( -6% - 6%) 0.794 BrowseDayOfYearTaxoFacets 1.22 (2.1%) 1.22 (2.1%) -0.2% ( -4% - 4%) 0.735 BrowseDateTaxoFacets 1.23 (2.0%) 1.23 (2.0%) -0.2% ( -4% - 3%) 0.782 BrowseDayOfYearSSDVFacets 2.90 (0.9%) 2.90 (1.0%) -0.1% ( -1% - 1%) 0.685 BrowseMonthSSDVFacets 3.15 (1.0%) 3.15 (1.1%) 0.0% ( -2% - 2%) 0.903 HighSloppyPhrase 7.89 (5.5%) 7.91 (4.4%) 0.2% ( -9% - 10%) 0.876 Fuzzy2 34.20 (9.0%) 34.31 (8.4%) 0.3% ( -15% - 19%) 0.909 Fuzzy1 44.78 (6.2%) 44.95 (6.0%) 0.4% ( -11% - 13%) 0.851 Respell 21.07 (2.5%) 21.16 (2.4%) 0.5% ( -4% - 5%) 0.552 HighPhrase 274.21 (5.3%) 279.29 (5.3%) 1.9% ( -8% - 13%) 0.269 {code} > Explore PFOR for Doc ID delta encoding (instead of FOR) > ------------------------------------------------------- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs > Affects Versions: main (9.0) > Reporter: Greg Miller > Priority: Minor > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org