[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Greg Miller (Jira) Thu, 18 Mar 2021 05:54:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304121#comment-17304121
 ]


Greg Miller commented on LUCENE-9850:
-------------------------------------

Still working on putting the tool I've hacked together in a sharable place, but 
here are some interesting results on bits-per-value using FOR vs. PFOR on a 
real index used in our production system (Amazon product search). Using PFOR 
instead of FOR in this example decreased the storage used for encoding doc IDs 
by *~8%*. Below are histograms showing what percent of doc ID blocks used 
various bits-per-value encodings.
{code:java}
Vanilla FOR Doc ID BPV
 0 *                                                   [0.45 pct] 
 1 *                                                   [0.00 pct] 
 2 *                                                   [1.17 pct] 
 3 *                                                   [1.28 pct] 
 4 ***                                                 [4.36 pct] 
 5 *****                                               [9.40 pct] 
 6 ****                                                [6.96 pct] 
 7 *****                                               [9.68 pct] 
 8 *******                                             [12.40 pct] 
 9 ********                                            [14.55 pct] 
10 ******                                              [11.78 pct] 
11 *****                                               [8.94 pct] 
12 ****                                                [6.61 pct] 
13 ***                                                 [4.37 pct] 
14 **                                                  [2.86 pct] 
15 *                                                   [1.82 pct] 
16 *                                                   [1.32 pct] 
17 *                                                   [1.06 pct] 
18 *                                                   [0.59 pct] 
19 *                                                   [0.27 pct] 
20 *                                                   [0.11 pct] 
21 *                                                   [0.02 pct] 
22 *                                                   [0.00 pct] 

PFOR Doc ID BPV
 0 *                                                   [0.59 pct] 
 1 *                                                   [0.21 pct] 
 2 *                                                   [1.74 pct] 
 3 **                                                  [2.08 pct] 
 4 *****                                               [9.94 pct] 
 5 ****                                                [7.58 pct] 
 6 *****                                               [8.11 pct] 
 7 *******                                             [12.30 pct] 
 8 *******                                             [12.92 pct] 
 9 *******                                             [12.28 pct] 
10 *****                                               [9.92 pct] 
11 ****                                                [7.52 pct] 
12 ***                                                 [5.30 pct] 
13 **                                                  [3.52 pct] 
14 **                                                  [2.36 pct] 
15 *                                                   [1.66 pct] 
16 *                                                   [1.10 pct] 
17 *                                                   [0.74 pct] 
18 *                                                   [0.13 pct] 
{code}
I'll also mention that when running an internal benchmarking tool, moving from 
FOR to PFOR actually improved our red-line queries/sec by 1%, which is somewhat 
counter-intuitive to me. I would expect the additional expense of applying PFOR 
exceptions would hurt qps, but maybe it's being offset by some benefit 
resulting from smaller bpv? It would be interesting to do this same index size 
and performance benchmarking on some other indexes to see if the results hold. 
I'll see if I can figure out how to do this.

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
>                 Key: LUCENE-9850
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9850
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Reply via email to