Julie Tibshirani created LUCENE-9322:
----------------------------------------

             Summary: Discussing a unified vectors format API
                 Key: LUCENE-9322
                 URL: https://issues.apache.org/jira/browse/LUCENE-9322
             Project: Lucene - Core
          Issue Type: New Feature
            Reporter: Julie Tibshirani


Two different approximate nearest neighbor approaches are currently being 
developed, one based on HNSW ([#LUCENE-9004]) and another based on coarse 
quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
that could support both approaches. The two ANN strategies give different 
trade-offs in terms of speed, memory, and complexity, and it’s likely that 
we’ll want to support both. Vector search is also an active research area, and 
it would be great to be able to prototype and incorporate new approaches 
without introducing more formats.

To me it seems like a good time to begin discussing a unified API. The 
prototype for coarse quantization 
([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
soon (this depends on everyone's feedback of course). The approach is simple 
and shows solid search performance, as seen 
[here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326]. 
I think this API discussion is an important step in moving that implementation 
forward.

The goals of the API would be
# Support for storing and retrieving individual float vectors.
# Support for approximate nearest neighbor search -- given a query vector, 
return the indexed vectors that are closest to it.

I pushed a proposal for the API here: 
https://github.com/jtibshirani/lucene-solr/pull/2

It adds a new format `VectorsFormat`, which can read and write vectors. This 
format and all associated classes would be experimental. Each approach would 
have its own format implementation, for example `HNSWVectorsFormat` or 
`CoarseQuantizationFormat`.
Given a field, the vectors reader returns a `VectorValues` object. This object 
supports the following:
* Retrieving a vector value for each document. This capability is currently 
exposed through a `DocIdSetIterator`, so forward iteration is encouraged. The 
simple coarse quantization approach could choose to store the vectors as binary 
doc values, while HNSW could use a new storage format with more explicit 
support for random access (which it needs to efficiently support ANN). Perhaps 
in the future we’ll just choose on a single way of storing the vectors that all 
implementations can use.
* Finding `k` approximate nearest neighbors to a query vector through the 
`findNearestVectors` method.

{code:java}
 /**
   * For the given query vector, finds an approximate set of nearest neighbors.
   *
   * @param queryVector the query vector.
   * @param k the number of nearest neighbors to return.
   * @param recallFactor a parameter which controls the recall of the search. 
Higher values correspond to better
   *                     recall at the expense of more distance computations. 
The exact meaning of this parameter
   *                     depends on the underlying nearest neighbor 
implementation.
   */
TopDocs findNearestVectors(float[] queryVector, int k, int recallFactor) throws 
new IOException;
{code}

Each format can use its dedicated data structures to perform ANN, without 
exposing the details externally. Adding this method to `VectorValues` lets us 
avoid having a different query type per ANN strategy, like `HNSWQuery`. Note 
that it’s a bit tricky to define a unified method, because each algorithm has 
different search parameters -- HNSW has `ef` to control the size of the 
candidate set, whereas coarse quantization has `numCentroids` to specify the 
number of nearest clusters that should be considered. This proposal takes a 
simple strategy: the implementation is allowed one tuning parameter to control 
recall, and the meaning of the parameter depends on the implementation.

On the write side, we would need to add the ability to buffer + write vectors 
in `DefaultIndexingChain`. This logic would be shared, but the flush and merge 
calls would be delegated to the format. So each implementation could build + 
write the specialized data structures it needs, and define its own way of 
performing merges.

Questions that may come up:
* _Why do we have different implementations of `VectorsFormat`, couldn’t we 
just add an enum to the field info like `Strategy.HNSW` and 
`Strategy.COARSE_QUANTIZATION`?_ It seemed cleaner to keep each strategy’s 
writing/ reading logic separate. The current design also makes it possible to 
plug-in a custom implementation of `VectorsFormat`.
* _What about different distance metrics like angular and L1 distance?_ This is 
an important aspect to consider. I think that even if we just supported 
euclidean distance at first, it would be really useful. Along with angular 
distance, it’s the distance metric I’ve seen used most frequently in 
applications. And euclidean distance is equal to angular distance if the 
vectors are normalized to unit length.
* _How exactly is this used in a search? Where are the `Query` classes?_ This 
would be the next part of the API to design/ discuss. I think the current 
format proposal could support a few different options.

Thanks, and looking forward to your thoughts!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to