Adrien Grand created LUCENE-10536:
-------------------------------------

             Summary: Doc values terms dicts should use the first term of each 
block as a dictionary
                 Key: LUCENE-10536
                 URL: https://issues.apache.org/jira/browse/LUCENE-10536
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


Doc values terms dictionaries split data into blocks of 64 terms, where the 
first term is written uncompressed (which is useful for binary searches), and 
the 63 other terms are encoded by taking the difference with the previous term 
and compressing all suffixes together with LZ4.

With this format, the suffix of the second term is also unlikely to benefit 
from any compression, since it doesn't have data to search for duplicate bytes 
into besides itself. A minor improvement we could make would consist of using 
the first term as a dictionary for suffixes of terms 2..64.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to