[jira] [Updated] (LUCENE-9905) Revise approach to specifying NN algorithm

Julie Tibshirani (Jira) Mon, 05 Apr 2021 17:45:05 -0700


     [ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Julie Tibshirani updated LUCENE-9905:
-------------------------------------
    Description: 
In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
particular nearest-neighbor search data structure and algorithm. This 
flexibility is important since NN search is a developing area and we'd like to 
be able to experiment and evolve the algorithm. Right now we only have one 
algorithm (HNSW), but we want to maintain the ability to use another.

Currently the algorithm to use is specified through {{SearchStrategy}}, for 
example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is 
expected to handle multiple algorithms. Instead we could have one format 
implementation per algorithm. Our current implementation would be HNSW-specific 
like {{HnswVectorFormat}}, and to experiment with another algorithm you could 
create a new implementation like {{ClusterVectorFormat}}. This would be better 
aligned with the codec framework, and help avoid exposing algorithm details in 
the API.

A concrete proposal (note many of these names will change when LUCENE-9855 is 
addressed):
# Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
# Remove references to HNSW in {{SearchStrategy}}, so there is just 
{{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like 
{{SimilarityFunction}}.
# Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
# Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
parameters to be configured per-field \(?\)

One note: the current HNSW-based format includes logic for storing a numeric 
vector per document, as well as constructing + storing a HNSW graph. When 
adding another implementation, it’d be nice to be able to reuse logic for 
reading/ writing numeric vectors. I don’t think we need to design for this 
right now, but we can keep it in mind for the future?

This issue is based on a thread [~jpountz] started: 
[https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]

  was:
In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
particular nearest-neighbor search data structure and algorithm. This 
flexibility is important since NN search is a developing area and we'd like to 
be able to experiment and evolve the algorithm. Right now we only have one 
algorithm (HNSW), but we want to maintain the ability to use another.

Currently the algorithm to use is specified through {{SearchStrategy}}, for 
example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is 
expected to handle multiple algorithms. Instead we could have one format 
implementation per algorithm. Our current implementation would be HNSW-specific 
like {{HnswVectorFormat}}, and to experiment with another algorithm you could 
create a new implementation like {{ClusterVectorFormat}}. This would be better 
aligned with the codec framework, and help avoid exposing algorithm details in 
the API.

A concrete proposal (note many of these names will change when LUCENE-9855 is 
addressed):
 1. Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
 2. Remove references to HNSW in {{SearchStrategy}}, so there is just 
{{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like 
{{SimilarityFunction}}.
 3. Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
 4. Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
parameters to be configured per-field (?)

One note: the current HNSW-based format includes logic for storing a numeric 
vector per document, as well as constructing + storing a HNSW graph. When 
adding another implementation, it’d be nice to be able to reuse logic for 
reading/ writing numeric vectors. I don’t think we need to design for this 
right now, but we can keep it in mind for the future?

This issue is based on a thread [~jpountz] started: 
[https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]


> Revise approach to specifying NN algorithm
> ------------------------------------------
>
>                 Key: LUCENE-9905
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9905
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Priority: Major
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9905) Revise approach to specifying NN algorithm

Reply via email to