[ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304608#comment-17304608
 ] 

Greg Miller commented on LUCENE-9850:
-------------------------------------

Alright, here's the tool I threw together to generate the bit-per-value 
histograms for encoded doc IDs. There are two branches to be aware of:

1. 
[bpv-tool-only|https://github.com/gsmiller/lucene/tree/LUCENE-9850/bpv-tool-only]:
 The tool lives in o.a.l.codecs.BitsPerValueTool.java and the index path needs 
to be specified as the single argument when running. It assumes doc IDs in the 
index use the standard FOR (delta) encoding and will generate some 
bits-per-value stats/histogram to serve as a baseline as shown above.
 2. 
[pfordocids-and-bpv-tool|https://github.com/gsmiller/lucene/tree/LUCENE-9850/pfordocids-and-bpv-tool]:
 This branch uses PFOR instead of FOR for encoding doc ID deltas. It exposes 
the same tool as in bpv-tool-only, and can be run in the same way. Note that 
the tool in bpv-tool-only _will not_ work on an index created with this branch 
(and vice versa) since it assumes FOR where-as this branch assumes PFOR.

If anyone is interesting in looking at bits-per-value reduction in their own 
index by using PFOR instead of FOR, you want to first run the tool in 
bpv-tool-only against your existing index to establish a baseline, then rebuild 
your index using the pfordocids-and-bpv-tool branch before finally running the 
tool in pfordocids-and-bpv-tool to see the new index stats. Note that you 
shouldn't use either of these branches for any performance testing since 
they've both been hacked up to maintain some histogram stats in the FOR/PFOR 
code which wouldn't be there in a real application. Hope that makes sense. Next 
I'll see if I can figure out a good way to measure some "standard" indexes and 
also run some performance benchmarks.

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
>                 Key: LUCENE-9850
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9850
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to