Julie Tibshirani created LUCENE-9322:
----------------------------------------
Summary: Discussing a unified vectors format API
Key: LUCENE-9322
URL: https://issues.apache.org/jira/browse/LUCENE-9322
Project: Lucene - Core
Issue Type: New Feature
Reporter: Julie Tibshirani
Two different approximate nearest neighbor approaches are currently being
developed, one based on HNSW ([#LUCENE-9004]) and another based on coarse
quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to
handle vectors. In LUCENE-9136 we discussed the possibility of a unified API
that could support both approaches. The two ANN strategies give different
trade-offs in terms of speed, memory, and complexity, and it’s likely that
we’ll want to support both. Vector search is also an active research area, and
it would be great to be able to prototype and incorporate new approaches
without introducing more formats.
To me it seems like a good time to begin discussing a unified API. The
prototype for coarse quantization
([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit
soon (this depends on everyone's feedback of course). The approach is simple
and shows solid search performance, as seen
[here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
I think this API discussion is an important step in moving that implementation
forward.
The goals of the API would be
# Support for storing and retrieving individual float vectors.
# Support for approximate nearest neighbor search -- given a query vector,
return the indexed vectors that are closest to it.
I pushed a proposal for the API here:
https://github.com/jtibshirani/lucene-solr/pull/2
It adds a new format `VectorsFormat`, which can read and write vectors. This
format and all associated classes would be experimental. Each approach would
have its own format implementation, for example `HNSWVectorsFormat` or
`CoarseQuantizationFormat`.
Given a field, the vectors reader returns a `VectorValues` object. This object
supports the following:
* Retrieving a vector value for each document. This capability is currently
exposed through a `DocIdSetIterator`, so forward iteration is encouraged. The
simple coarse quantization approach could choose to store the vectors as binary
doc values, while HNSW could use a new storage format with more explicit
support for random access (which it needs to efficiently support ANN). Perhaps
in the future we’ll just choose on a single way of storing the vectors that all
implementations can use.
* Finding `k` approximate nearest neighbors to a query vector through the
`findNearestVectors` method.
{code:java}
/**
* For the given query vector, finds an approximate set of nearest neighbors.
*
* @param queryVector the query vector.
* @param k the number of nearest neighbors to return.
* @param recallFactor a parameter which controls the recall of the search.
Higher values correspond to better
* recall at the expense of more distance computations.
The exact meaning of this parameter
* depends on the underlying nearest neighbor
implementation.
*/
TopDocs findNearestVectors(float[] queryVector, int k, int recallFactor) throws
new IOException;
{code}
Each format can use its dedicated data structures to perform ANN, without
exposing the details externally. Adding this method to `VectorValues` lets us
avoid having a different query type per ANN strategy, like `HNSWQuery`. Note
that it’s a bit tricky to define a unified method, because each algorithm has
different search parameters -- HNSW has `ef` to control the size of the
candidate set, whereas coarse quantization has `numCentroids` to specify the
number of nearest clusters that should be considered. This proposal takes a
simple strategy: the implementation is allowed one tuning parameter to control
recall, and the meaning of the parameter depends on the implementation.
On the write side, we would need to add the ability to buffer + write vectors
in `DefaultIndexingChain`. This logic would be shared, but the flush and merge
calls would be delegated to the format. So each implementation could build +
write the specialized data structures it needs, and define its own way of
performing merges.
Questions that may come up:
* _Why do we have different implementations of `VectorsFormat`, couldn’t we
just add an enum to the field info like `Strategy.HNSW` and
`Strategy.COARSE_QUANTIZATION`?_ It seemed cleaner to keep each strategy’s
writing/ reading logic separate. The current design also makes it possible to
plug-in a custom implementation of `VectorsFormat`.
* _What about different distance metrics like angular and L1 distance?_ This is
an important aspect to consider. I think that even if we just supported
euclidean distance at first, it would be really useful. Along with angular
distance, it’s the distance metric I’ve seen used most frequently in
applications. And euclidean distance is equal to angular distance if the
vectors are normalized to unit length.
* _How exactly is this used in a search? Where are the `Query` classes?_ This
would be the next part of the API to design/ discuss. I think the current
format proposal could support a few different options.
Thanks, and looking forward to your thoughts!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]