Julie Tibshirani created LUCENE-9905:
----------------------------------------
Summary: Revise approach to specifying NN algorithm
Key: LUCENE-9905
URL: https://issues.apache.org/jira/browse/LUCENE-9905
Project: Lucene - Core
Issue Type: Improvement
Reporter: Julie Tibshirani
In LUCENE-9322 we decided that the new vectors API shouldn’t assume a
particular nearest-neighbor search data structure and algorithm. This
flexibility is important since NN search is a developing area and we'd like to
be able to experiment and evolve the algorithm. Right now we only have one
algorithm (HNSW), but we want to maintain the ability to use another.
Currently the algorithm to use is specified through {{SearchStrategy}}, for
example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is
expected to handle multiple algorithms. Instead we could have one format
implementation per algorithm. Our current implementation would be HNSW-specific
like {{HnswVectorFormat}}, and to experiment with another algorithm you could
create a new implementation like {{ClusterVectorFormat}}. This would be better
aligned with the codec framework, and help avoid exposing algorithm details in
the API.
A concrete proposal (note many of these names will change when LUCENE-9855 is
addressed):
1. Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add
HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
2. Remove references to HNSW in {{SearchStrategy}}, so there is just
{{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like
{{SimilarityFunction}}.
3. Remove {{FieldType}} attributes related to HNSW parameters (maxConn and
beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
4. Introduce {{PerFieldVectorFormat}} to allow a different NN approach or
parameters to be configured per-field (?)
One note: the current HNSW-based format includes logic for storing a numeric
vector per document, as well as constructing + storing a HNSW graph. When
adding another implementation, it’d be nice to be able to reuse logic for
reading/ writing numeric vectors. I don’t think we need to design for this
right now, but we can keep it in mind for the future?
This issue is based on a thread [~jpountz] started:
[https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]