[ https://issues.apache.org/jira/browse/LUCENE-10536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17535967#comment-17535967 ]
ASF subversion and git services commented on LUCENE-10536: ---------------------------------------------------------- Commit 8f89db8048033cdd63ee08ee9e99d64c6fa4c90d in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8f89db80480 ] LUCENE-10536: Slightly better compression of doc values' terms dictionaries. (#838) Doc values terms dictionaries keep the first term of each block uncompressed so that they can somewhat efficiently perform binary searches across blocks. Suffixes of the other 63 terms are compressed together using LZ4 to leverage redundancy across suffixes. This change improves compression a bit by using the first (uncompressed) term of each block as a dictionary when compressing suffixes of the 63 other terms. This helps with compressing the first few suffixes when there's not much context yet that can be leveraged to find duplicates. > Doc values terms dicts should use the first term of each block as a dictionary > ------------------------------------------------------------------------------ > > Key: LUCENE-10536 > URL: https://issues.apache.org/jira/browse/LUCENE-10536 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Time Spent: 1h > Remaining Estimate: 0h > > Doc values terms dictionaries split data into blocks of 64 terms, where the > first term is written uncompressed (which is useful for binary searches), and > the 63 other terms are encoded by taking the difference with the previous > term and compressing all suffixes together with LZ4. > With this format, the suffix of the second term is also unlikely to benefit > from any compression, since it doesn't have data to search for duplicate > bytes into besides itself. A minor improvement we could make would consist of > using the first term as a dictionary for suffixes of terms 2..64. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org