[ https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220400#comment-17220400 ]
Julie Tibshirani commented on LUCENE-9583: ------------------------------------------ > Without it, we would need to load all vectors into RAM while > flushing/merging, as we currently do in BinaryDocValuesWriter.BinaryDVs. I > wonder if it's worth paying this cost for the simpler API. This made me notice that we ignore vectors in {{SortingCodecReader}}, which can be used to sort an index after it's already been created. I opened https://github.com/apache/lucene-solr/pull/2028 to address this. I'm not an expert in this code, but to me the trade-off seems worth it for a well-scoped API. Having a tighter set of methods makes it clear to callers how vectors are intended to be used: for retrieving docs through kNN or as a contributor to document scores. And there could still be room for future optimizations to avoid reloading the vectors? For example, when flushing we always work with {{BufferedVectorValues}} -- maybe {{SortingVectorValues}} could take that in directly. > Another thing I noticed while reviewing this is that I moved the KNN > {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} > to {{VectorValues.RandomAccess}}. This I think we could move back, and handle > the HNSW requirements for search elsewhere. This seems like a nice change, no matter what happens with {{RandomAccess}} ! > How should we expose VectorValues.RandomAccess? > ----------------------------------------------- > > Key: LUCENE-9583 > URL: https://issues.apache.org/jira/browse/LUCENE-9583 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael Sokolov > Priority: Major > > In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} > sub-interface. [~jtibshirani] pointed out this is not needed by some > vector-indexing strategies which can operate solely using a forward-iterator > (it is needed by HNSW), and so in the interest of simplifying the public API > we should not expose this internal detail (which by the way surfaces internal > ordinals that are somewhat uninteresting outside the random access API). > I looked into how to move this inside the HNSW-specific code and remembered > that we do also currently make use of the RA API when merging vector fields > over sorted indexes. Without it, we would need to load all vectors into RAM > while flushing/merging, as we currently do in > {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost > for the simpler API. > Another thing I noticed while reviewing this is that I moved the KNN > {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} > to {{VectorValues.RandomAccess}}. This I think we could move back, and > handle the HNSW requirements for search elsewhere. I wonder if that would > alleviate the major concern here? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org