[
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280390#comment-17280390
]
Jaison.Bi commented on LUCENE-9663:
-----------------------------------
Ok...Will create a new issue..Thanks [~broustant]
> Adding compression to terms dict from SortedSet/Sorted DocValues
> ----------------------------------------------------------------
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Reporter: Jaison.Bi
> Priority: Trivial
> Fix For: master (9.0)
>
> Time Spent: 11h
> Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications,
> “keyword” is the most frequently used field type.
> LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do
> better by replacing prefix-compression with LZ4. In one of our application,
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
> I've done simple tests based on the real application data, comparing the
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields.
> I'm doing the benchmark test based on luceneutil. Will attach the report and
> patch after the test.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]