[
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julie Tibshirani updated LUCENE-9905:
-------------------------------------
Description:
In LUCENE-9322 we decided that the new vectors API shouldn’t assume a
particular nearest-neighbor search data structure and algorithm. This
flexibility is important since NN search is a developing area and we'd like to
be able to experiment and evolve the algorithm. Right now we only have one
algorithm (HNSW), but we want to maintain the ability to use another.
Currently the algorithm to use is specified through {{SearchStrategy}}, for
example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is
expected to handle multiple algorithms. Instead we could have one format
implementation per algorithm. Our current implementation would be HNSW-specific
like {{HnswVectorFormat}}, and to experiment with another algorithm you could
create a new implementation like {{ClusterVectorFormat}}. This would be better
aligned with the codec framework, and help avoid exposing algorithm details in
the API.
A concrete proposal (note many of these names will change when LUCENE-9855 is
addressed):
# Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add
HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
# Remove references to HNSW in {{SearchStrategy}}, so there is just
{{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like
{{SimilarityFunction}}.
# Remove {{FieldType}} attributes related to HNSW parameters (maxConn and
beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
# Introduce {{PerFieldVectorFormat}} to allow a different NN approach or
parameters to be configured per-field \(?\)
One note: the current HNSW-based format includes logic for storing a numeric
vector per document, as well as constructing + storing a HNSW graph. When
adding another implementation, it’d be nice to be able to reuse logic for
reading/ writing numeric vectors. I don’t think we need to design for this
right now, but we can keep it in mind for the future?
This issue is based on a thread [~jpountz] started:
[https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]
was:
In LUCENE-9322 we decided that the new vectors API shouldn’t assume a
particular nearest-neighbor search data structure and algorithm. This
flexibility is important since NN search is a developing area and we'd like to
be able to experiment and evolve the algorithm. Right now we only have one
algorithm (HNSW), but we want to maintain the ability to use another.
Currently the algorithm to use is specified through {{SearchStrategy}}, for
example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation is
expected to handle multiple algorithms. Instead we could have one format
implementation per algorithm. Our current implementation would be HNSW-specific
like {{HnswVectorFormat}}, and to experiment with another algorithm you could
create a new implementation like {{ClusterVectorFormat}}. This would be better
aligned with the codec framework, and help avoid exposing algorithm details in
the API.
A concrete proposal (note many of these names will change when LUCENE-9855 is
addressed):
1. Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add
HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
2. Remove references to HNSW in {{SearchStrategy}}, so there is just
{{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something like
{{SimilarityFunction}}.
3. Remove {{FieldType}} attributes related to HNSW parameters (maxConn and
beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
4. Introduce {{PerFieldVectorFormat}} to allow a different NN approach or
parameters to be configured per-field (?)
One note: the current HNSW-based format includes logic for storing a numeric
vector per document, as well as constructing + storing a HNSW graph. When
adding another implementation, it’d be nice to be able to reuse logic for
reading/ writing numeric vectors. I don’t think we need to design for this
right now, but we can keep it in mind for the future?
This issue is based on a thread [~jpountz] started:
[https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]
> Revise approach to specifying NN algorithm
> ------------------------------------------
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Julie Tibshirani
> Priority: Major
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a
> particular nearest-neighbor search data structure and algorithm. This
> flexibility is important since NN search is a developing area and we'd like
> to be able to experiment and evolve the algorithm. Right now we only have one
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation
> is expected to handle multiple algorithms. Instead we could have one format
> implementation per algorithm. Our current implementation would be
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another
> algorithm you could create a new implementation like {{ClusterVectorFormat}}.
> This would be better aligned with the codec framework, and help avoid
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric
> vector per document, as well as constructing + storing a HNSW graph. When
> adding another implementation, it’d be nice to be able to reuse logic for
> reading/ writing numeric vectors. I don’t think we need to design for this
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started:
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]