[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Greg Miller (Jira) Fri, 27 Aug 2021 10:41:31 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405944#comment-17405944
 ]


Greg Miller commented on LUCENE-10033:
--------------------------------------

[~weizijun] thanks for providing the updated results! Keeping in mind that 
there's a diverse set of use-cases for Lucene, and many consider performance to 
be critical (e.g., QPS, latency), I wouldn't be in favor of a change to the 
default codec that results in a ~50% regression, even if it does show better 
index size compression. I'm not sure if we have a repository of "alternative" 
type codecs that you might consider contributing to (i.e., creating or 
contributing to an alternative codec that heavily favors index size over 
decoding performance), but [~jpountz] would know better and could probably 
offer some advice there. Thanks again for experimenting with this though! 
Interesting to see the results!

> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
>                 Key: LUCENE-10033
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10033
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: benchmark, benchmark-10m
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Reply via email to