[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308723#comment-17308723 ]
Greg Miller commented on LUCENE-9850: ------------------------------------- Also, here's the direct impact on bits-per-value on the EN Wikipedia index generated from the benchmark "-source wikimediiumall" task. Looks like ~10% reduction in the doc ID block. This is a little naive though since it's not taking into account extra storage with PFOR for the exceptions, but it helps illustrate the difference that's creating the 2% index size reduction. {code:java} BASELINE (FOR) DOC ID BPV 0 **** [6.86 pct] (1529467 of 22287406) 1 * [0.00 pct] (91 of 22287406) 2 * [0.60 pct] (133848 of 22287406) 3 ** [2.09 pct] (466022 of 22287406) 4 ** [3.06 pct] (683006 of 22287406) 5 *** [4.44 pct] (990644 of 22287406) 6 *** [5.86 pct] (1305537 of 22287406) 7 ***** [8.38 pct] (1867660 of 22287406) 8 ***** [9.92 pct] (2211136 of 22287406) 9 ****** [10.79 pct] (2405504 of 22287406) 10 ***** [9.77 pct] (2178356 of 22287406) 11 ***** [8.61 pct] (1919968 of 22287406) 12 **** [7.63 pct] (1701251 of 22287406) 13 **** [6.40 pct] (1426872 of 22287406) 14 *** [4.94 pct] (1101624 of 22287406) 15 ** [3.62 pct] (806380 of 22287406) 16 ** [2.62 pct] (583235 of 22287406) 17 * [1.83 pct] (407402 of 22287406) 18 * [1.28 pct] (285690 of 22287406) 19 * [0.78 pct] (172866 of 22287406) 20 * [0.27 pct] (59108 of 22287406) 21 * [0.12 pct] (26582 of 22287406) 22 * [0.08 pct] (17481 of 22287406) 23 * [0.03 pct] (7676 of 22287406) 24 [0.00 pct] (0 of 22287406) 25 [0.00 pct] (0 of 22287406) 26 [0.00 pct] (0 of 22287406) 27 [0.00 pct] (0 of 22287406) 28 [0.00 pct] (0 of 22287406) 29 [0.00 pct] (0 of 22287406) 30 [0.00 pct] (0 of 22287406) 31 [0.00 pct] (0 of 22287406) Total bytes used: 25746066 CANDIDATE (PFOR) DOC ID BPV 0 **** [7.06 pct] (1573609 of 22287406) 1 * [0.62 pct] (139054 of 22287406) 2 ** [2.12 pct] (471777 of 22287406) 3 ** [3.70 pct] (824652 of 22287406) 4 *** [4.95 pct] (1102450 of 22287406) 5 *** [5.68 pct] (1266069 of 22287406) 6 **** [7.95 pct] (1772639 of 22287406) 7 ***** [9.86 pct] (2197883 of 22287406) 8 ***** [9.92 pct] (2211276 of 22287406) 9 ***** [9.25 pct] (2061395 of 22287406) 10 ***** [8.53 pct] (1902012 of 22287406) 11 **** [7.68 pct] (1710722 of 22287406) 12 **** [6.41 pct] (1427739 of 22287406) 13 *** [5.01 pct] (1117073 of 22287406) 14 ** [3.89 pct] (866890 of 22287406) 15 ** [2.81 pct] (627122 of 22287406) 16 * [2.00 pct] (444684 of 22287406) 17 * [1.38 pct] (308501 of 22287406) 18 * [0.91 pct] (203542 of 22287406) 19 * [0.24 pct] (52612 of 22287406) 20 * [0.03 pct] (5689 of 22287406) 21 * [0.00 pct] (16 of 22287406) 22 [0.00 pct] (0 of 22287406) 23 [0.00 pct] (0 of 22287406) 24 [0.00 pct] (0 of 22287406) 25 [0.00 pct] (0 of 22287406) 26 [0.00 pct] (0 of 22287406) 27 [0.00 pct] (0 of 22287406) 28 [0.00 pct] (0 of 22287406) 29 [0.00 pct] (0 of 22287406) 30 [0.00 pct] (0 of 22287406) 31 [0.00 pct] (0 of 22287406) Total bytes used: 23091702 {code} > Explore PFOR for Doc ID delta encoding (instead of FOR) > ------------------------------------------------------- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs > Affects Versions: main (9.0) > Reporter: Greg Miller > Priority: Minor > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org