kaivalnp commented on PR #12820: URL: https://github.com/apache/lucene/pull/12820#issuecomment-1816720340
Thanks @jpountz! I realised something from your comment: My current implementation has a flaw, because it cannot handle the [`OrdinalTranslatedKnnCollector`](https://github.com/kaivalnp/lucene/blob/03624754eb3ccf9da114ff5fc358cc230466ea3f/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L232) correctly: The [`setVisited`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L253) call has the `BitSet visited` as packed ordinals, but the [`getVisited`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/OrdinalTranslatedKnnCollector.java#L91) call receives a `docId` (and not a `vectorId`) so we would need a reverse [`IntToIntFunction docIdToVectorOrdinal`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/OrdinalTranslatedKnnCollector.java#L31) to map it back to an ordinal This is straightforward for `DenseOffHeapVectorValues` or `EmptyOffHeapVectorValues` (because there is a 1-1 mapping between a doc and ordinal) but becomes a problem for `SparseOffHeapVectorValues` which has the `vectorOrdinaltoDocId` implemented as a [`DirectMonotonicReader`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapFloatVectorValues.java#L139) - which I think is docIds stored one after another - so getting the docId for an ordinal is as simple as a lookup at that offset. However, getting an inverse of this can become costly (binary search -> returning the index) as opposed to the current constant time lookup I wonder how costly it would be to maintain the set of visited docs at the `KnnCollector` like you mentioned (perhaps using a `SparseFixedBitSet`)? We [already create a `BitSet`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L121) of `maxDoc` length to hold the filtered docs.. In the worst case, we would need another `BitSet` of the same length to store which docs are visited from graph search, then skip over those from `#exactSearch`. However, there may be a better opportunity here: since we want to go over docs that are "prefiltered but not visited", can we simply `#clear` the bits whenever we visit a node - we just need to find a way to do this cleanly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org