[ 
https://issues.apache.org/jira/browse/LUCENE-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116518#comment-17116518
 ] 

Adrien Grand commented on LUCENE-9378:
--------------------------------------

I profiled some of these sorting tasks to understand where time is spent, and 
while there is non-negligible time spent reading lengths, the bulk of the CPU 
time is spent decompressing bytes given how highly compressible titles are in 
wikimedium with lots of exact duplicates. Furthermore, in your case, decoding 
the lengths is likely even cheaper given that all documents have the same 
length.

We have discussed building dictionaries in the past, in order to not have to 
decompress all values when we need a single one in a block. This was initially 
for stored fields, but I believe that this could help in this case here. It's 
not really a low hanging fruit though. Would it be an ok workaround for you if, 
in the meantime, we just disabled compression for short binary values? E.g. 
assuming that the average length of values in a block is small enough, then 
we'd write values without compression. This would help avoid introducing a flag.

> Configurable compression for BinaryDocValues
> --------------------------------------------
>
>                 Key: LUCENE-9378
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9378
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Viral Gandhi
>            Priority: Minor
>
> Lucene 8.5.1 includes a change to always [compress 
> BinaryDocValues|https://issues.apache.org/jira/browse/LUCENE-9211]. This 
> caused (~30%) reduction in our red-line QPS (throughput). 
> We think users should be given some way to opt-in for this compression 
> feature instead of always being enabled which can have a substantial query 
> time cost as we saw during our upgrade. [~mikemccand] suggested one possible 
> approach by introducing a *mode* in Lucene80DocValuesFormat (COMPRESSED and 
> UNCOMPRESSED) and allowing users to create a custom Codec subclassing the 
> default Codec and pick the format they want.
> Idea is similar to Lucene50StoredFieldsFormat which has two modes, 
> Mode.BEST_SPEED and Mode.BEST_COMPRESSION.
> Here's related issues for adding benchmark covering BINARY doc values 
> query-time performance - [https://github.com/mikemccand/luceneutil/issues/61]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to