[jira] [Commented] (LUCENE-10449) Performance regression due to LZ4 compression of TermsDict in SortedSetDocValues

Robert Muir (Jira) Wed, 02 Mar 2022 06:31:22 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500198#comment-17500198
 ]


Robert Muir commented on LUCENE-10449:
--------------------------------------

{quote}
Our need in terms of data access pattern is to scan and retrieve a large number 
of keyword (binary) values.
{quote}

This is exactly what the binarydocvalues is designed to do. Try it out, i'm 
gonna make a guess that it performs "many times faster" for your use-case than 
what you are doing now.


> Performance regression due to LZ4 compression of TermsDict in 
> SortedSetDocValues
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-10449
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10449
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/codecs
>    Affects Versions: 9.0
>            Reporter: Renaud Delbru
>            Priority: Major
>         Attachments: lucene-8.11-no-compression.png, lucene-9.png
>
>
> LUCENE-9843 removed the compression option for SortedSetDocValues term 
> dictionaries and enabled LZ4 compression all the time. This has quite an 
> impact on our workloads which heavily uses sorted set doc values. It may lead 
> to perf regression from 2x up to 5x. See samples below.
>  
> {code:java}
> ❯ times_tasks Elasticsearch 7.10.2 (Lucene 8.7) - no terms dict compression
> name                      type                        time_min time_max 
> time_p50 time_p90
> 7.10.2-22.6-SNAPSHOT.json total                       42       90       45    
>    66
> 7.10.2-22.6-SNAPSHOT.json SearchJoinRequest1          14       32       15    
>    18
> 7.10.2-22.6-SNAPSHOT.json SearchTaskBroadcastRequest2 23       53       27    
>    43
> ❯ times_tasks Elasticsearch 7.17.1 (Lucene 8.11) - with terms dict compression
> name                      type                        time_min time_max 
> time_p50 time_p90
> 7.17.0-27.1-SNAPSHOT.json total                       253      327      285   
>    310
> 7.17.0-27.1-SNAPSHOT.json SearchJoinRequest1          121      154      142   
>    152
> 7.17.0-27.1-SNAPSHOT.json SearchTaskBroadcastRequest2 122      173      140   
>    152
> ❯ times_tasks Elasticsearch 7.17.1 (Lucene 8.11) - lucene_default codec is 
> used to bypass the terms dict compression 
> name                        type                        time_min time_max 
> time_p50 time_p90
> 7.17.0-27.1-SNAPSHOT.json.2 total                       48       96       63  
>      75
> 7.17.0-27.1-SNAPSHOT.json.2 SearchJoinRequest1          19       44       25  
>      31
> 7.17.0-27.1-SNAPSHOT.json.2 SearchTaskBroadcastRequest2 23       42       29  
>      37
> ❯ times_tasks Elasticsearch 8.0 (Lucene 9.0) - with terms dict compression
> name                     type                        time_min time_max 
> time_p50 time_p90
> 8.0.0-28.0-SNAPSHOT.json total                       260      327      287    
>   313
> 8.0.0-28.0-SNAPSHOT.json SearchJoinRequest1          122      168      148    
>   158
> 8.0.0-28.0-SNAPSHOT.json SearchTaskBroadcastRequest2 123      165      139    
>   155
> {code}
> We can clearly see in the benchmark the impact of the terms dict compression 
> in our workload. Profiling the execution indicates that the bottleneck is the 
> {{{}LZ4.decompress{}}}. We have attached two screenshots of a flamegraph.
> The CPU time of the {{TermsDict.next}} method with Lucene 8.11 with no terms 
> dict compression is around 2 seconds, while the CPU time of the same method 
> in Lucene 9.0 is 12 seconds. This was measured on a small benchmark reading a 
> fixed number of times a sorted set doc values field. Each document is created 
> with a single keyword value that represents a UUID. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10449) Performance regression due to LZ4 compression of TermsDict in SortedSetDocValues

Reply via email to