[
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359411#comment-17359411
]
ASF subversion and git services commented on LUCENE-9905:
---------------------------------------------------------
Commit e9339253f5ebcd88282297bdadcbe1705e15f91b in lucene's branch
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e933925 ]
LUCENE-9905: Make sure to use configured vector format when merging (#176)
Before when creating a VectorWriter for merging, we would always load the
default implementation. So if the format was configured with parameters, they
were ignored.
This issue was caught by `TestKnnGraph#testMergeProducesSameGraph`.
> Revise approach to specifying NN algorithm
> ------------------------------------------
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: main (9.0)
> Reporter: Julie Tibshirani
> Priority: Blocker
> Time Spent: 5h 40m
> Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a
> particular nearest-neighbor search data structure and algorithm. This
> flexibility is important since NN search is a developing area and we'd like
> to be able to experiment and evolve the algorithm. Right now we only have one
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation
> is expected to handle multiple algorithms. Instead we could have one format
> implementation per algorithm. Our current implementation would be
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another
> algorithm you could create a new implementation like {{ClusterVectorFormat}}.
> This would be better aligned with the codec framework, and help avoid
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric
> vector per document, as well as constructing + storing a HNSW graph. When
> adding another implementation, it’d be nice to be able to reuse logic for
> reading/ writing numeric vectors. I don’t think we need to design for this
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started:
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]