jbellis commented on PR #12255:
URL: https://github.com/apache/lucene/pull/12255#issuecomment-1541171234

   If the algorithm is implemented correctly, and I think that it is, then in 
theory the order of neighbor traversal should not matter.
   
   But we are seeing a difference here, so I *think* what causes that is the 
limited precision that you get in practice when computing vector similarities.  
If you have enough vectors, and enough dimensions, then the round off error can 
accumulate enough to make a difference.  That is why the test suite does note 
surface this difference.
   
   I performed 38 runs of the Texmex SIFT benchmark with known-correct KNN.  
This resulted in the new code having a very tiny bit better recall on average, 
with p-value 0.16.  My statistics is a bit rusty (it's very rusty) but I 
believe we're justified in concluding that recall is no worse than before, at 
least on this test.
   
   Google sheet is 
[here](https://docs.google.com/spreadsheets/d/1Xcx43x30AmTpm-7GH_SJwkwGQ93P4wYPrClGomebtsk/edit)
 and raw data is attached as csv.
   
   The first column is the new code, and the second is the old (git sha 
1fa2be9).
   
   [combined.csv](https://github.com/apache/lucene/files/11437240/combined.csv)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to