[
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159599#comment-17159599
]
Alex Klibisz commented on LUCENE-9322:
--------------------------------------
Hi all. Some great discussion here and in #9004 and #9136.
I've been working on an Elasticsearch plugin for ANN for about 8 months now:
[http://elastiknn.klibisz.com/ |http://elastiknn.klibisz.com/]Obviously using
Lucene under-the-hood but I'm definitely more fluent in Elasticsearch concepts
than Lucene internals.
Figured I would mention: One of the early bottlenecks was vector serialization
(using BinaryDocValues to store the vectors). I did extensive benchmarking to
figure out the fastest way to de-/serialize `float[]` and `int[]` arrays
to/from byte arrays. In the end I ended up finding the `sun.misc.Unsafe` module
beat all others. Here's the Java utility class that I'm using for
de-/serialization in my plugin:
[https://github.com/alexklibisz/elastiknn/blob/adf8262907093315d772ae524e822a1152b0e929/core/src/main/java/com/klibisz/elastiknn/storage/UnsafeSerialization.java]
Maybe it can be helpful.
> Discussing a unified vectors format API
> ---------------------------------------
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Julie Tibshirani
> Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being
> developed, one based on HNSW (LUCENE-9004) and another based on coarse
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API
> that could support both approaches. The two ANN strategies give different
> trade-offs in terms of speed, memory, and complexity, and it’s likely that
> we’ll want to support both. Vector search is also an active research area,
> and it would be great to be able to prototype and incorporate new approaches
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The
> prototype for coarse quantization
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit
> soon (this depends on everyone's feedback of course). The approach is simple
> and shows solid search performance, as seen
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
> I think this API discussion is an important step in moving that
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector,
> return the indexed vectors that are closest to it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]