jbellis commented on PR #12255: URL: https://github.com/apache/lucene/pull/12255#issuecomment-1541171234
If the algorithm is implemented correctly, and I think that it is, then in theory the order of neighbor traversal should not matter. But we are seeing a difference here, so I *think* what causes that is the limited precision that you get in practice when computing vector similarities. If you have enough vectors, and enough dimensions, then the round off error can accumulate enough to make a difference. That is why the test suite does note surface this difference. I performed 38 runs of the Texmex SIFT benchmark with known-correct KNN. This resulted in the new code having a very tiny bit better recall on average, with p-value 0.16. My statistics is a bit rusty (it's very rusty) but I believe we're justified in concluding that recall is no worse than before, at least on this test. Google sheet is [here](https://docs.google.com/spreadsheets/d/1Xcx43x30AmTpm-7GH_SJwkwGQ93P4wYPrClGomebtsk/edit) and raw data is attached as csv. The first column is the new code, and the second is the old (git sha 1fa2be9). [combined.csv](https://github.com/apache/lucene/files/11437240/combined.csv) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org