[jira] [Created] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Adrien Grand (Jira) Thu, 22 Jul 2021 10:32:05 -0700

Adrien Grand created LUCENE-10033:
-------------------------------------

             Summary: Encode doc values in smaller blocks of values, like 
postings
                 Key: LUCENE-10033
                 URL: https://issues.apache.org/jira/browse/LUCENE-10033
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand



This is a follow-up to the discussion on this thread: 
https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.

Our current approach for doc values uses large blocks of 16k values where 
values can be decompressed independently, using DirectWriter/DirectReader. This 
is a bit inefficient in some cases, e.g. a single outlier can grow the number 
of bits per value for the entire block, we can't easily use run-length 
compression, etc. Plus, it encourages using a different sub-class for every 
compression technique, which puts pressure on the JVM.

We'd like to move to an approach that would be more similar to postings with 
smaller blocks (e.g. 128 values) whose values get all decompressed at once 
(using SIMD instructions), with skip data within blocks in order to efficiently 
skip to arbitrary doc IDs (or maybe still use jump tables as today's doc 
values, and as discussed here for postings: 
https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Reply via email to