[GitHub] [lucene] jbellis commented on pull request #12255: allocate one NeighborQueue per search for results

via GitHub Tue, 09 May 2023 18:47:34 -0700


jbellis commented on PR #12255:
URL: https://github.com/apache/lucene/pull/12255#issuecomment-1541171234

If the algorithm is implemented correctly, and I think that it is, then in
theory the order of neighbor traversal should not matter.

But we are seeing a difference here, so I *think* what causes that is the
limited precision that you get in practice when computing vector similarities.
If you have enough vectors, and enough dimensions, then the round off error can
accumulate enough to make a difference. That is why the test suite does note
surface this difference.

I performed 38 runs of the Texmex SIFT benchmark with known-correct KNN.
This resulted in the new code having a very tiny bit better recall on average,
with p-value 0.16. My statistics is a bit rusty (it's very rusty) but I
believe we're justified in concluding that recall is no worse than before, at
least on this test.

Google sheet is
[here](https://docs.google.com/spreadsheets/d/1Xcx43x30AmTpm-7GH_SJwkwGQ93P4wYPrClGomebtsk/edit)
and raw data is attached as csv.

The first column is the new code, and the second is the old (git sha
1fa2be9).

[combined.csv](https://github.com/apache/lucene/files/11437240/combined.csv)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jbellis commented on pull request #12255: allocate one NeighborQueue per search for results

Reply via email to