[jira] [Created] (LUCENE-9905) Revise approach to specifying NN algorithm

Julie Tibshirani (Jira) Mon, 05 Apr 2021 17:45:04 -0700

Julie Tibshirani created LUCENE-9905:
----------------------------------------


             Summary: Revise approach to specifying NN algorithm
                 Key: LUCENE-9905
                 URL: https://issues.apache.org/jira/browse/LUCENE-9905
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Julie Tibshirani


In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
particular nearest-neighbor search data structure and algorithm. This 
flexibility is important since NN search is a developing area and we'd like to 
be able to experiment and evolve the algorithm. Right now we only have one 
algorithm (HNSW), but we want to maintain the ability to use another.

Currently the algorithm to use is specified through {{SearchStrategy}}, for 
example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is 
expected to handle multiple algorithms. Instead we could have one format 
implementation per algorithm. Our current implementation would be HNSW-specific 
like {{HnswVectorFormat}}, and to experiment with another algorithm you could 
create a new implementation like {{ClusterVectorFormat}}. This would be better 
aligned with the codec framework, and help avoid exposing algorithm details in 
the API.

A concrete proposal (note many of these names will change when LUCENE-9855 is 
addressed):
 1. Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
 2. Remove references to HNSW in {{SearchStrategy}}, so there is just 
{{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like 
{{SimilarityFunction}}.
 3. Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
 4. Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
parameters to be configured per-field (?)

One note: the current HNSW-based format includes logic for storing a numeric 
vector per document, as well as constructing + storing a HNSW graph. When 
adding another implementation, it’d be nice to be able to reuse logic for 
reading/ writing numeric vectors. I don’t think we need to design for this 
right now, but we can keep it in mind for the future?

This issue is based on a thread [~jpountz] started: 
[https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-9905) Revise approach to specifying NN algorithm

Reply via email to