[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

Julie Tibshirani (Jira) Sun, 25 Oct 2020 14:48:01 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220400#comment-17220400
 ]


Julie Tibshirani commented on LUCENE-9583:
------------------------------------------

> Without it, we would need to load all vectors into RAM while 
> flushing/merging, as we currently do in BinaryDocValuesWriter.BinaryDVs. I 
> wonder if it's worth paying this cost for the simpler API.

This made me notice that we ignore vectors in {{SortingCodecReader}}, which can 
be used to sort an index after it's already been created. I opened 
https://github.com/apache/lucene-solr/pull/2028 to address this.

I'm not an expert in this code, but to me the trade-off seems worth it for a 
well-scoped API. Having a tighter set of methods makes it clear to callers how 
vectors are intended to be used: for retrieving docs through kNN or as a 
contributor to document scores. And there could still be room for future 
optimizations to avoid reloading the vectors? For example, when flushing we 
always work with {{BufferedVectorValues}} -- maybe {{SortingVectorValues}} 
could take that in directly.

> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
> to {{VectorValues.RandomAccess}}. This I think we could move back, and handle 
> the HNSW requirements for search elsewhere.

This seems like a nice change, no matter what happens with {{RandomAccess}} !

> How should we expose VectorValues.RandomAccess?
> -----------------------------------------------
>
>                 Key: LUCENE-9583
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9583
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Major
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

Reply via email to