[jira] [Commented] (LUCENE-9378) Configurable compression for BinaryDocValues

Alex Klibisz (Jira) Tue, 16 Jun 2020 06:52:15 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136668#comment-17136668
 ]


Alex Klibisz commented on LUCENE-9378:
--------------------------------------

[~jpountz] Sure, I'll explain below:

My plugin is doing nearest neighbors search on sparse and dense vectors. 
Neighbors are "near" based on a similarity score like L1, L2, Angular, Hamming, 
or Jaccard similarity.

Also, when I say "Vector" I mean it in the math/physics sense, not the "Vector" 
style of data structure. The only relevant data structure for storing a vector 
is a simple arrays of floats or ints.

I'm storing the contents of each vector in binary doc values. So for dense 
floating point vectors, I store the literal numbers (e.g. 0.9,0.22,1.234,...) 
as a `float[]`. And for sparse boolean vectors, I store indices where the 
vector is "true" (1,22,99,101,...) as a `int[]`.

In both cases the ints and floats are serialized as a byte array using the 
sun.misc.Unsafe module and passed to Lucene as a `new BinaryDocValuesField()`. 
From the perspective of Lucene, the serialization protocol shouldn't matter. I 
could just as well be using an ObjectOutputStream, DataOutputStream, etc. 
Typical vector length is ~1000, so 1000 4-byte ints/floats produces `byte[]` 
with length 4000. I also experimented with variable-length encoding schemes, 
but determined it wasn't saving much space at all since Lucene was already 
compressing the byte array.

My benchmark just repeatedly runs queries against a corpus of these stored 
vectors. So it's  loop like:
 * get a query vector
 * for every doc in the lucene shard
 ** read the vector corresponding to the doc from binary doc values (this is 
LZ4.decompress() part that got much slower with the upgrade to 8.5.0).
 ** convert the bytearray to an array of floats or ints
 ** compute the similarity score of the array against the query vector (this is 
the `sortedIntersectionCount` in the screenshots I posted).
 ** return the score

The code is in a very experimental state, but if it helps I can try to clean it 
up and make it reproducible for others.

It seems like a nice solution would be the ability to configure or disable the 
level of compression when I store a BinaryDocValuesField.

Or maybe there is another way to store these vectors that avoid the compression 
overhead? I'm open to other options. I'm much more familiar with Elasticsearch 
internals than I am with Lucene internals.

 

> Configurable compression for BinaryDocValues
> --------------------------------------------
>
>                 Key: LUCENE-9378
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9378
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Viral Gandhi
>            Priority: Minor
>         Attachments: image-2020-06-12-22-17-30-339.png, 
> image-2020-06-12-22-17-53-961.png, image-2020-06-12-22-18-24-527.png, 
> image-2020-06-12-22-18-48-919.png
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Lucene 8.5.1 includes a change to always [compress 
> BinaryDocValues|https://issues.apache.org/jira/browse/LUCENE-9211]. This 
> caused (~30%) reduction in our red-line QPS (throughput). 
> We think users should be given some way to opt-in for this compression 
> feature instead of always being enabled which can have a substantial query 
> time cost as we saw during our upgrade. [~mikemccand] suggested one possible 
> approach by introducing a *mode* in Lucene80DocValuesFormat (COMPRESSED and 
> UNCOMPRESSED) and allowing users to create a custom Codec subclassing the 
> default Codec and pick the format they want.
> Idea is similar to Lucene50StoredFieldsFormat which has two modes, 
> Mode.BEST_SPEED and Mode.BEST_COMPRESSION.
> Here's related issues for adding benchmark covering BINARY doc values 
> query-time performance - [https://github.com/mikemccand/luceneutil/issues/61]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9378) Configurable compression for BinaryDocValues

Reply via email to