[ https://issues.apache.org/jira/browse/LUCENE-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136668#comment-17136668 ]
Alex Klibisz commented on LUCENE-9378: -------------------------------------- [~jpountz] Sure, I'll explain below: My plugin is doing nearest neighbors search on sparse and dense vectors. Neighbors are "near" based on a similarity score like L1, L2, Angular, Hamming, or Jaccard similarity. Also, when I say "Vector" I mean it in the math/physics sense, not the "Vector" style of data structure. The only relevant data structure for storing a vector is a simple arrays of floats or ints. I'm storing the contents of each vector in binary doc values. So for dense floating point vectors, I store the literal numbers (e.g. 0.9,0.22,1.234,...) as a `float[]`. And for sparse boolean vectors, I store indices where the vector is "true" (1,22,99,101,...) as a `int[]`. In both cases the ints and floats are serialized as a byte array using the sun.misc.Unsafe module and passed to Lucene as a `new BinaryDocValuesField()`. From the perspective of Lucene, the serialization protocol shouldn't matter. I could just as well be using an ObjectOutputStream, DataOutputStream, etc. Typical vector length is ~1000, so 1000 4-byte ints/floats produces `byte[]` with length 4000. I also experimented with variable-length encoding schemes, but determined it wasn't saving much space at all since Lucene was already compressing the byte array. My benchmark just repeatedly runs queries against a corpus of these stored vectors. So it's loop like: * get a query vector * for every doc in the lucene shard ** read the vector corresponding to the doc from binary doc values (this is LZ4.decompress() part that got much slower with the upgrade to 8.5.0). ** convert the bytearray to an array of floats or ints ** compute the similarity score of the array against the query vector (this is the `sortedIntersectionCount` in the screenshots I posted). ** return the score The code is in a very experimental state, but if it helps I can try to clean it up and make it reproducible for others. It seems like a nice solution would be the ability to configure or disable the level of compression when I store a BinaryDocValuesField. Or maybe there is another way to store these vectors that avoid the compression overhead? I'm open to other options. I'm much more familiar with Elasticsearch internals than I am with Lucene internals. > Configurable compression for BinaryDocValues > -------------------------------------------- > > Key: LUCENE-9378 > URL: https://issues.apache.org/jira/browse/LUCENE-9378 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Viral Gandhi > Priority: Minor > Attachments: image-2020-06-12-22-17-30-339.png, > image-2020-06-12-22-17-53-961.png, image-2020-06-12-22-18-24-527.png, > image-2020-06-12-22-18-48-919.png > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Lucene 8.5.1 includes a change to always [compress > BinaryDocValues|https://issues.apache.org/jira/browse/LUCENE-9211]. This > caused (~30%) reduction in our red-line QPS (throughput). > We think users should be given some way to opt-in for this compression > feature instead of always being enabled which can have a substantial query > time cost as we saw during our upgrade. [~mikemccand] suggested one possible > approach by introducing a *mode* in Lucene80DocValuesFormat (COMPRESSED and > UNCOMPRESSED) and allowing users to create a custom Codec subclassing the > default Codec and pick the format they want. > Idea is similar to Lucene50StoredFieldsFormat which has two modes, > Mode.BEST_SPEED and Mode.BEST_COMPRESSION. > Here's related issues for adding benchmark covering BINARY doc values > query-time performance - [https://github.com/mikemccand/luceneutil/issues/61] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org