[ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308723#comment-17308723
 ] 

Greg Miller commented on LUCENE-9850:
-------------------------------------

Also, here's the direct impact on bits-per-value on the EN Wikipedia index 
generated from the benchmark "-source wikimediiumall" task. Looks like ~10% 
reduction in the doc ID block. This is a little naive though since it's not 
taking into account extra storage with PFOR for the exceptions, but it helps 
illustrate the difference that's creating the 2% index size reduction.
{code:java}
BASELINE (FOR)

DOC ID BPV
 0 ****                                                [6.86 pct] (1529467 of 
22287406)
 1 *                                                   [0.00 pct] (91 of 
22287406)
 2 *                                                   [0.60 pct] (133848 of 
22287406)
 3 **                                                  [2.09 pct] (466022 of 
22287406)
 4 **                                                  [3.06 pct] (683006 of 
22287406)
 5 ***                                                 [4.44 pct] (990644 of 
22287406)
 6 ***                                                 [5.86 pct] (1305537 of 
22287406)
 7 *****                                               [8.38 pct] (1867660 of 
22287406)
 8 *****                                               [9.92 pct] (2211136 of 
22287406)
 9 ******                                              [10.79 pct] (2405504 of 
22287406)
10 *****                                               [9.77 pct] (2178356 of 
22287406)
11 *****                                               [8.61 pct] (1919968 of 
22287406)
12 ****                                                [7.63 pct] (1701251 of 
22287406)
13 ****                                                [6.40 pct] (1426872 of 
22287406)
14 ***                                                 [4.94 pct] (1101624 of 
22287406)
15 **                                                  [3.62 pct] (806380 of 
22287406)
16 **                                                  [2.62 pct] (583235 of 
22287406)
17 *                                                   [1.83 pct] (407402 of 
22287406)
18 *                                                   [1.28 pct] (285690 of 
22287406)
19 *                                                   [0.78 pct] (172866 of 
22287406)
20 *                                                   [0.27 pct] (59108 of 
22287406)
21 *                                                   [0.12 pct] (26582 of 
22287406)
22 *                                                   [0.08 pct] (17481 of 
22287406)
23 *                                                   [0.03 pct] (7676 of 
22287406)
24                                                     [0.00 pct] (0 of 
22287406)
25                                                     [0.00 pct] (0 of 
22287406)
26                                                     [0.00 pct] (0 of 
22287406)
27                                                     [0.00 pct] (0 of 
22287406)
28                                                     [0.00 pct] (0 of 
22287406)
29                                                     [0.00 pct] (0 of 
22287406)
30                                                     [0.00 pct] (0 of 
22287406)
31                                                     [0.00 pct] (0 of 
22287406)
Total bytes used: 25746066


CANDIDATE (PFOR)

DOC ID BPV
 0 ****                                                [7.06 pct] (1573609 of 
22287406)
 1 *                                                   [0.62 pct] (139054 of 
22287406)
 2 **                                                  [2.12 pct] (471777 of 
22287406)
 3 **                                                  [3.70 pct] (824652 of 
22287406)
 4 ***                                                 [4.95 pct] (1102450 of 
22287406)
 5 ***                                                 [5.68 pct] (1266069 of 
22287406)
 6 ****                                                [7.95 pct] (1772639 of 
22287406)
 7 *****                                               [9.86 pct] (2197883 of 
22287406)
 8 *****                                               [9.92 pct] (2211276 of 
22287406)
 9 *****                                               [9.25 pct] (2061395 of 
22287406)
10 *****                                               [8.53 pct] (1902012 of 
22287406)
11 ****                                                [7.68 pct] (1710722 of 
22287406)
12 ****                                                [6.41 pct] (1427739 of 
22287406)
13 ***                                                 [5.01 pct] (1117073 of 
22287406)
14 **                                                  [3.89 pct] (866890 of 
22287406)
15 **                                                  [2.81 pct] (627122 of 
22287406)
16 *                                                   [2.00 pct] (444684 of 
22287406)
17 *                                                   [1.38 pct] (308501 of 
22287406)
18 *                                                   [0.91 pct] (203542 of 
22287406)
19 *                                                   [0.24 pct] (52612 of 
22287406)
20 *                                                   [0.03 pct] (5689 of 
22287406)
21 *                                                   [0.00 pct] (16 of 
22287406)
22                                                     [0.00 pct] (0 of 
22287406)
23                                                     [0.00 pct] (0 of 
22287406)
24                                                     [0.00 pct] (0 of 
22287406)
25                                                     [0.00 pct] (0 of 
22287406)
26                                                     [0.00 pct] (0 of 
22287406)
27                                                     [0.00 pct] (0 of 
22287406)
28                                                     [0.00 pct] (0 of 
22287406)
29                                                     [0.00 pct] (0 of 
22287406)
30                                                     [0.00 pct] (0 of 
22287406)
31                                                     [0.00 pct] (0 of 
22287406)
Total bytes used: 23091702
{code}

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
>                 Key: LUCENE-9850
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9850
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to