Julie Tibshirani created LUCENE-9322: ----------------------------------------
Summary: Discussing a unified vectors format API Key: LUCENE-9322 URL: https://issues.apache.org/jira/browse/LUCENE-9322 Project: Lucene - Core Issue Type: New Feature Reporter: Julie Tibshirani Two different approximate nearest neighbor approaches are currently being developed, one based on HNSW ([#LUCENE-9004]) and another based on coarse quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to handle vectors. In LUCENE-9136 we discussed the possibility of a unified API that could support both approaches. The two ANN strategies give different trade-offs in terms of speed, memory, and complexity, and it’s likely that we’ll want to support both. Vector search is also an active research area, and it would be great to be able to prototype and incorporate new approaches without introducing more formats. To me it seems like a good time to begin discussing a unified API. The prototype for coarse quantization ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit soon (this depends on everyone's feedback of course). The approach is simple and shows solid search performance, as seen [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326]. I think this API discussion is an important step in moving that implementation forward. The goals of the API would be # Support for storing and retrieving individual float vectors. # Support for approximate nearest neighbor search -- given a query vector, return the indexed vectors that are closest to it. I pushed a proposal for the API here: https://github.com/jtibshirani/lucene-solr/pull/2 It adds a new format `VectorsFormat`, which can read and write vectors. This format and all associated classes would be experimental. Each approach would have its own format implementation, for example `HNSWVectorsFormat` or `CoarseQuantizationFormat`. Given a field, the vectors reader returns a `VectorValues` object. This object supports the following: * Retrieving a vector value for each document. This capability is currently exposed through a `DocIdSetIterator`, so forward iteration is encouraged. The simple coarse quantization approach could choose to store the vectors as binary doc values, while HNSW could use a new storage format with more explicit support for random access (which it needs to efficiently support ANN). Perhaps in the future we’ll just choose on a single way of storing the vectors that all implementations can use. * Finding `k` approximate nearest neighbors to a query vector through the `findNearestVectors` method. {code:java} /** * For the given query vector, finds an approximate set of nearest neighbors. * * @param queryVector the query vector. * @param k the number of nearest neighbors to return. * @param recallFactor a parameter which controls the recall of the search. Higher values correspond to better * recall at the expense of more distance computations. The exact meaning of this parameter * depends on the underlying nearest neighbor implementation. */ TopDocs findNearestVectors(float[] queryVector, int k, int recallFactor) throws new IOException; {code} Each format can use its dedicated data structures to perform ANN, without exposing the details externally. Adding this method to `VectorValues` lets us avoid having a different query type per ANN strategy, like `HNSWQuery`. Note that it’s a bit tricky to define a unified method, because each algorithm has different search parameters -- HNSW has `ef` to control the size of the candidate set, whereas coarse quantization has `numCentroids` to specify the number of nearest clusters that should be considered. This proposal takes a simple strategy: the implementation is allowed one tuning parameter to control recall, and the meaning of the parameter depends on the implementation. On the write side, we would need to add the ability to buffer + write vectors in `DefaultIndexingChain`. This logic would be shared, but the flush and merge calls would be delegated to the format. So each implementation could build + write the specialized data structures it needs, and define its own way of performing merges. Questions that may come up: * _Why do we have different implementations of `VectorsFormat`, couldn’t we just add an enum to the field info like `Strategy.HNSW` and `Strategy.COARSE_QUANTIZATION`?_ It seemed cleaner to keep each strategy’s writing/ reading logic separate. The current design also makes it possible to plug-in a custom implementation of `VectorsFormat`. * _What about different distance metrics like angular and L1 distance?_ This is an important aspect to consider. I think that even if we just supported euclidean distance at first, it would be really useful. Along with angular distance, it’s the distance metric I’ve seen used most frequently in applications. And euclidean distance is equal to angular distance if the vectors are normalized to unit length. * _How exactly is this used in a search? Where are the `Query` classes?_ This would be the next part of the API to design/ discuss. I think the current format proposal could support a few different options. Thanks, and looking forward to your thoughts! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org