Jaison.Bi created LUCENE-9663:
---------------------------------

             Summary: Adding compression to terms dict from SortedSet/Sorted 
DocValues
                 Key: LUCENE-9663
                 URL: https://issues.apache.org/jira/browse/LUCENE-9663
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/codecs
            Reporter: Jaison.Bi


Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
“keyword” is the most frequently used field type.Elasticsearch keyword field 
uses SortedSet DocValues. In our applications, “keyword” is the most frequently 
used field type.
LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
better by replacing prefix-compression with LZ4. In one of our application, the 
dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
I've done simple tests based on the real application data, comparing the 
write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
segment).
|| ||Before||After||
|Write time cost(ms)|591972|618200|
|Merge time cost(ms)|270661|294663|
|*.dvd file size(GB)|1.95|1.15|

This feature is only for the high-cardinality fields. 
I'm doing the benchmark test based on luceneutil. Will attach the report and 
patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to