kaivalnp commented on PR #12820:
URL: https://github.com/apache/lucene/pull/12820#issuecomment-1816720340

   Thanks @jpountz! I realised something from your comment:
   
   My current implementation has a flaw, because it cannot handle the 
[`OrdinalTranslatedKnnCollector`](https://github.com/kaivalnp/lucene/blob/03624754eb3ccf9da114ff5fc358cc230466ea3f/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L232)
 correctly: The 
[`setVisited`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L253)
 call has the `BitSet visited` as packed ordinals, but the 
[`getVisited`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/OrdinalTranslatedKnnCollector.java#L91)
 call receives a `docId` (and not a `vectorId`) so we would need a reverse 
[`IntToIntFunction 
docIdToVectorOrdinal`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/OrdinalTranslatedKnnCollector.java#L31)
 to map it back to an ordinal
   
   This is straightforward for `DenseOffHeapVectorValues` or 
`EmptyOffHeapVectorValues` (because there is a 1-1 mapping between a doc and 
ordinal) but becomes a problem for `SparseOffHeapVectorValues` which has the 
`vectorOrdinaltoDocId` implemented as a 
[`DirectMonotonicReader`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapFloatVectorValues.java#L139)
 - which I think is docIds stored one after another - so getting the docId for 
an ordinal is as simple as a lookup at that offset. However, getting an inverse 
of this can become costly (binary search -> returning the index) as opposed to 
the current constant time lookup
   
   I wonder how costly it would be to maintain the set of visited docs at the 
`KnnCollector` like you mentioned (perhaps using a `SparseFixedBitSet`)? We 
[already create a 
`BitSet`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L121)
 of `maxDoc` length to hold the filtered docs..
   
   In the worst case, we would need another `BitSet` of the same length to 
store which docs are visited from graph search, then skip over those from 
`#exactSearch`. However, there may be a better opportunity here: since we want 
to go over docs that are "prefiltered but not visited", can we simply `#clear` 
the bits whenever we visit a node - we just need to find a way to do this 
cleanly?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to