Mark Harwood created LUCENE-9211:
------------------------------------
Summary: Adding compression to BinaryDocValues storage
Key: LUCENE-9211
URL: https://issues.apache.org/jira/browse/LUCENE-9211
Project: Lucene - Core
Issue Type: Improvement
Components: core/codecs
Reporter: Mark Harwood
Assignee: Mark Harwood
While SortedSetDocValues can be used today to store identical values in a
compact form this is not effective for data with many unique values.
The proposal is that BinaryDocValues should be stored in LZ4 compressed blocks
which can dramatically reduce disk storage costs in many cases. The proposal is
blocks of a number of documents are stored as a single compressed blob along
with metadata that records offsets where the original document values can be
found in the uncompressed content.
There's a trade-off here between efficient compression (more docs-per-block =
better compression) and fast retrieval times (fewer docs-per-block = faster
read access for single values). A fixed block size of 32 docs seems like it
would be a reasonable compromise for most scenarios.
A PR is up for review here [https://github.com/apache/lucene-solr/pull/1234]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]