kaivalnp commented on issue #12955:
URL: https://github.com/apache/lucene/issues/12955#issuecomment-1868306253

   This test 
[checks](https://github.com/apache/lucene/blob/dc9f154aa574e8cd0e60070a1814c1d221fbec5d/lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java#L398-L406)
 whether the `FloatVectorSimilarityQuery` can surface *all* vectors above a 
[`resultSimilarity`](https://github.com/apache/lucene/blob/dc9f154aa574e8cd0e60070a1814c1d221fbec5d/lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java#L376)
   
   The test fails because it expects 45 results to be found, but actually finds 
44. This should ideally not be possible, because we set a 
[`traversalSimilarity` of 
`Float.NEGATIVE_INFINITY`](https://github.com/apache/lucene/blob/dc9f154aa574e8cd0e60070a1814c1d221fbec5d/lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java#L392)
 -- so all nodes in the HNSW graph with a score above `-Infinity` will be 
traversed (which is all nodes in the HNSW graph)
   
   I suspect this has something to do with a disconnected graph, where one of 
the nodes is a valid result but not reachable. To demonstrate this, I wrote a 
[snippet](https://github.com/apache/lucene/commit/edbbfd6d55fd1cd1cfb14ed558945660fa4c0690)
 that calculates the nodes reachable from the entry point, the doc that was 
missed from results, and the doc and score of unreachable nodes. Here is the 
result (using the repro command above):
   
   ```
     1> Similarity = 0.097851
     1> Total Nodes = 131, Reachable Nodes = 129
     1> 
     1> Unreachable Nodes:
     1> Doc = 76, Score = 0.103340
     1> Doc = 117, Score = 0.083347
     1> 
     1> Missed Docs = {76=0.103340186}
   ```
   
   Looks like the missed doc is unreachable..
   
   I suspect a similar case is possible for KNN search as well (for example 
[this test 
case](https://github.com/apache/lucene/blob/dc9f154aa574e8cd0e60070a1814c1d221fbec5d/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java#L470))
 -- where we index random vectors and search for a random topK, not sure if we 
have seen such failures there?
   
   I also see an open issue for graph disconnectedness: #12627 
   
   As for the fix here, we can try something like a lower number of dimensions 
or lower number of vectors, but the issue will only get less common until a 
permanent solution is found?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to