[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Michael McCandless (Jira) Tue, 23 Mar 2021 09:33:04 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307210#comment-17307210
 ]


Michael McCandless commented on LUCENE-9850:
--------------------------------------------

Here's the result of {{bpv-tool-only}} on Lucene nightly benchmarks (EN 
wikipedia) index:
{noformat}
DOC ID BPV
 0 ****                                                [7.93 pct] (1466075 of 
18484892)
 1 *                                                   [0.00 pct] (165 of 
18484892)
 2 *                                                   [0.58 pct] (106653 of 
18484892)
 3 **                                                  [2.17 pct] (400444 of 
18484892)
 4 **                                                  [3.16 pct] (584748 of 
18484892)
 5 ***                                                 [4.51 pct] (833082 of 
18484892)
 6 ***                                                 [5.86 pct] (1082974 of 
18484892)
 7 *****                                               [8.45 pct] (1561144 of 
18484892)
 8 *****                                               [9.93 pct] (1835188 of 
18484892)
 9 ******                                              [10.66 pct] (1970466 of 
18484892)
10 *****                                               [9.68 pct] (1788853 of 
18484892)
11 *****                                               [8.62 pct] (1594306 of 
18484892)
12 ****                                                [7.62 pct] (1409009 of 
18484892)
13 ****                                                [6.23 pct] (1151456 of 
18484892)
14 ***                                                 [4.72 pct] (872013 of 
18484892)
15 **                                                  [3.46 pct] (640401 of 
18484892)
16 **                                                  [2.52 pct] (466228 of 
18484892)
17 *                                                   [1.73 pct] (320292 of 
18484892)
18 *                                                   [1.19 pct] (220389 of 
18484892)
19 *                                                   [0.62 pct] (114238 of 
18484892)
20 *                                                   [0.21 pct] (38229 of 
18484892)
21 *                                                   [0.09 pct] (16846 of 
18484892)
22 *                                                   [0.05 pct] (9250 of 
18484892)
23 *                                                   [0.01 pct] (2443 of 
18484892)
24                                                     [0.00 pct] (0 of 
18484892)
25                                                     [0.00 pct] (0 of 
18484892)
26                                                     [0.00 pct] (0 of 
18484892)
27                                                     [0.00 pct] (0 of 
18484892)
28                                                     [0.00 pct] (0 of 
18484892)
29                                                     [0.00 pct] (0 of 
18484892)
30                                                     [0.00 pct] (0 of 
18484892)
31                                                     [0.00 pct] (0 of 
18484892)
Total bytes used: 20912256 {noformat}
Curious how many 0-bit cases there are!

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
>                 Key: LUCENE-9850
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9850
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Reply via email to