[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263831#comment-17263831
 ] 

Jaison.Bi commented on LUCENE-9663:
-----------------------------------

Thanks for the comment, [~sokolov]
{quote}if you are running luceneutil tests, could you please also report QPS 
changes?
{quote}
Sure, I will.
{quote}I'm not clear what the usage of this {{keywords}} field is exactly - is 
it used for aggregations?
{quote}
Ya, "keyword" field is used for aggregations mostly. 
{quote}It would be good to run a faceting test; luceneutil doesn't really have 
any tests of high-cardinality SSDV aggregations; I think day-of-year is the 
closest it gets. Maybe you could add one? It's important to test the impact on 
the query side.
{quote}
ok, I will learn how to change luceneutil. Meanwhile, I can do another 
benchmark test using *esrally* as a supplement, it has some aggregation tests. 
would it be alright?

Actually, aggregations are using *global ordinal data* instead of terms dict, 
terms dict compression will affect the performance of building global oridinal 
data. Anyway, I will test the impact on query side.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> ----------------------------------------------------------------
>
>                 Key: LUCENE-9663
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9663
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Jaison.Bi
>            Priority: Trivial
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to