[ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161697#comment-17161697
 ] 

Julie Tibshirani commented on LUCENE-9322:
------------------------------------------

Hello everyone, I'm sorry for the very late response. Thank you for your 
comments on the proposal! And thanks [~alexklibisz] for the suggestions, I will 
take a look at the link.
{quote}Personally I would prefer an unified file format for vectors since it is 
(theoretically) independent from higher level ANN algorithms. Could we expose 
just one "Lucene90VectorsFormat" and low-level I/O, and make only higher logic 
(o.a.l.a.index/document/search) to be customizable? Forward iteration is 
encouraged anyway... 
{quote}
I'm not sure how we could have a completely unified `VectorsFormat`, because 
different ANN algorithms require building and maintaining customized data 
structures like nearest-neighbor graphs? However it would be great to share the 
logic for writing/ reading the original vectors if possible.
{quote}What about different distance metrics like angular and L1 distance? JFYI 
I previously implemented switchable distance function on the HNSW branch, if 
you have not noticed it…
{quote}
I have the same intuition as Mayya that it’s nice to keep the design simple at 
first and just use euclidean distance in the first iteration. It’s possible to 
rank based on angular distance using euclidean distance by first normalizing 
the document and query vectors to unit length. However I could certainly see 
support for maximum inner product search being useful in the future. 
{quote}Query part would also need some abstraction and there are many things to 
be well thought..., so could we discuss about it in another dedicated issue, to 
keep the scope here small ?
{quote}
Right, perhaps we can focus on moving the current proposal forward before 
nailing down how it will integrate with `Query`. It will be an interesting 
follow-up discussion!
{quote}How would we feel to break this part and commit it separately ? 
{quote}
Personally I would be okay with committing basic vector support first, but with 
solid APIs/ plugin points for ANN as well. My motivation with considering both 
vectors and ANN was to make sure the APIs + codec design could accommodate all 
the functionality we think is important.

> Discussing a unified vectors format API
> ---------------------------------------
>
>                 Key: LUCENE-9322
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9322
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Julie Tibshirani
>            Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to