[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307226#comment-17307226 ]
Greg Miller commented on LUCENE-9850: ------------------------------------- {quote}I wonder if it would help to make the encoding/decoding aware that not all numbers of bits per value are equal. For instance the benchmarks ([https://github.com/jpountz/decode-128-ints-benchmark]) I ran when looking into vectorizing decoding suggested that throughputs were highly dependent on the number of bits per value. So maybe we could tune PFOR to never e.g. go from 16 bits per value to 15 because the savings are small while the decoding is significantly slower. {quote} Yeah, interesting thought! I experimented with a similar idea to always round up to powers of 2 bpv (i.e., 1, 2, 4, 8, 16) since the code for decoding those bpv's appears much simpler and more optimized. I wasn't aware of your benchmark results at the time I tried this, but it seems to generally align with your findings (with maybe a couple exceptions). This increased our red-line queries/sec by +2.3% but came at the cost of +9.6% index size (yikes)! Most of the index size growth was coming from rounding up everything larger than 8 but less than 16. When I capped the rounding to 8, the index only grew by +1% but red-line queries/sec improvements were only +0.7%. I think you're right though, in that there's probably some interesting work to not round everything up, but be more precise with the bpv's we try to avoid. {quote}Also maybe the PFOR decoding logic could still optimize the prefix sum in the case when there are no exceptions? {quote} This is a great suggestion! I'll add that logic. > Explore PFOR for Doc ID delta encoding (instead of FOR) > ------------------------------------------------------- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs > Affects Versions: main (9.0) > Reporter: Greg Miller > Priority: Minor > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org