[ 
https://issues.apache.org/jira/browse/LUCENE-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302094#comment-17302094
 ] 

Michael McCandless commented on LUCENE-9843:
--------------------------------------------

I agree having options on {{Codec}} implementations adds frustrating code 
complexity!

But the compression vs speed option is an especially tricky one since it is so 
brutally use-case dependent.

Some users want the smallest possible indices and do not care so much about 
query performance.  Others are willing to have larger indices if querying can 
go even a wee bit faster.  Our usage (Amazon's customer-facing product search) 
is in the latter category: when we first upgraded to Lucene 8.5.1, which 
enabled compression for all {{BINARY}} fields with no option to disable it, it 
was a big (~30%) hit to red-line QPS in our internal (single production shard) 
benchmarks.

We proceeded with upgrading, but forked the default {{Codec}} to fallback to 
the pre-8.5 implementation for doc values as a short term measure, and then 
iterated (LUCENE-9378 and [https://github.com/apache/lucene-solr/pull/1543] and 
[https://github.com/apache/lucene-solr/pull/2069] – thank you [~jpountz]!) to 
add the option for compression.  But we would really rather not live with a 
long-term fork of the default {{Codec}}...

At least two other users/use-cases also saw negative impact to their apps using 
Lucene due to {{BINARY}} compression: vectors extension in Elasticsearch and 
Twitter.

We also learned, surprisingly, that compression was GOOD for {{luceneutil}} 
faceting tasks, perhaps because those tasks compute facets on all documents 
({{MatchAllDocsQuery}}) and so they load {{byte[]}} for every document in the 
index, which is best case for compression since the decompression cost is 
"maximally amortized" and that reduces how many bytes are loaded from the 
index.  We (Amazon hat) have since enabled compression for Lucene faceting in 
our usage as well, since it was neutral within noise on search metrics yet 
reduced the index.

So we are using compression for some {{BINARY}} doc values fields, but turning 
it off for other fields.  Having the choice is helpful/impactful, for us 
anyways.

I think especially for {{BINARY}} doc values the use-cases can be even more 
diverse, since it is more of a catch-all doc values type, where applications 
can encode interesting things into {{byte[]}}.

If we really must take away the "speed versus compression" choice then I think 
we should also remove the compression, i.e. we should not try to compress 
{{BINARY}} fields?  Or, would it make the code simpler if we just made it 
another {{DocValuesType}} e.g. {{BINARY}} and {{BINARY_COMPRESSED}} or 
something?

NOTE: LUCENE-9211 is where we first added the {{BINARY}} compression.

I agree testing is also harder because of this option.  Maybe we could improve 
the test infra, e.g. the silly tool ({{TestBackwardsCompatibility}} itself I 
think) that generates older indices for testing, to do a better job toggling 
between {{SPEED}} and {{COMPRESSED}} when it generates test indices?

> Remove compression option on doc values
> ---------------------------------------
>
>                 Key: LUCENE-9843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9843
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Options on file formats add complexity and put a big tax on 
> backward-compatibility testing. I'm the one who introduced it LUCENE-9378 but 
> I would now like to think about what we can do to remove this option.
> For the record, compression was initially introduced because some binary 
> fields have so much redundancy that it's wasteful not to compress them at 
> all. But unfortunately, this slowed down some search workloads and we decided 
> to introduce this option as a way to let users choose the trade-off they want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to