[ https://issues.apache.org/jira/browse/LUCENE-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116518#comment-17116518 ]
Adrien Grand commented on LUCENE-9378: -------------------------------------- I profiled some of these sorting tasks to understand where time is spent, and while there is non-negligible time spent reading lengths, the bulk of the CPU time is spent decompressing bytes given how highly compressible titles are in wikimedium with lots of exact duplicates. Furthermore, in your case, decoding the lengths is likely even cheaper given that all documents have the same length. We have discussed building dictionaries in the past, in order to not have to decompress all values when we need a single one in a block. This was initially for stored fields, but I believe that this could help in this case here. It's not really a low hanging fruit though. Would it be an ok workaround for you if, in the meantime, we just disabled compression for short binary values? E.g. assuming that the average length of values in a block is small enough, then we'd write values without compression. This would help avoid introducing a flag. > Configurable compression for BinaryDocValues > -------------------------------------------- > > Key: LUCENE-9378 > URL: https://issues.apache.org/jira/browse/LUCENE-9378 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Viral Gandhi > Priority: Minor > > Lucene 8.5.1 includes a change to always [compress > BinaryDocValues|https://issues.apache.org/jira/browse/LUCENE-9211]. This > caused (~30%) reduction in our red-line QPS (throughput). > We think users should be given some way to opt-in for this compression > feature instead of always being enabled which can have a substantial query > time cost as we saw during our upgrade. [~mikemccand] suggested one possible > approach by introducing a *mode* in Lucene80DocValuesFormat (COMPRESSED and > UNCOMPRESSED) and allowing users to create a custom Codec subclassing the > default Codec and pick the format they want. > Idea is similar to Lucene50StoredFieldsFormat which has two modes, > Mode.BEST_SPEED and Mode.BEST_COMPRESSION. > Here's related issues for adding benchmark covering BINARY doc values > query-time performance - [https://github.com/mikemccand/luceneutil/issues/61] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org