kaivalnp commented on issue #12955: URL: https://github.com/apache/lucene/issues/12955#issuecomment-1868306253
This test [checks](https://github.com/apache/lucene/blob/dc9f154aa574e8cd0e60070a1814c1d221fbec5d/lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java#L398-L406) whether the `FloatVectorSimilarityQuery` can surface *all* vectors above a [`resultSimilarity`](https://github.com/apache/lucene/blob/dc9f154aa574e8cd0e60070a1814c1d221fbec5d/lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java#L376) The test fails because it expects 45 results to be found, but actually finds 44. This should ideally not be possible, because we set a [`traversalSimilarity` of `Float.NEGATIVE_INFINITY`](https://github.com/apache/lucene/blob/dc9f154aa574e8cd0e60070a1814c1d221fbec5d/lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java#L392) -- so all nodes in the HNSW graph with a score above `-Infinity` will be traversed (which is all nodes in the HNSW graph) I suspect this has something to do with a disconnected graph, where one of the nodes is a valid result but not reachable. To demonstrate this, I wrote a [snippet](https://github.com/apache/lucene/commit/edbbfd6d55fd1cd1cfb14ed558945660fa4c0690) that calculates the nodes reachable from the entry point, the doc that was missed from results, and the doc and score of unreachable nodes. Here is the result (using the repro command above): ``` 1> Similarity = 0.097851 1> Total Nodes = 131, Reachable Nodes = 129 1> 1> Unreachable Nodes: 1> Doc = 76, Score = 0.103340 1> Doc = 117, Score = 0.083347 1> 1> Missed Docs = {76=0.103340186} ``` Looks like the missed doc is unreachable.. I suspect a similar case is possible for KNN search as well (for example [this test case](https://github.com/apache/lucene/blob/dc9f154aa574e8cd0e60070a1814c1d221fbec5d/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java#L470)) -- where we index random vectors and search for a random topK, not sure if we have seen such failures there? I also see an open issue for graph disconnectedness: #12627 As for the fix here, we can try something like a lower number of dimensions or lower number of vectors, but the issue will only get less common until a permanent solution is found? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org