searchivarius commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1667954978
Hi @benwtrent first question: do you have your test code publicly available? >we SHOULD be using euclidean as the similarity comparator. My tests used "dot-product" (angular). Second, great you remembered it, but I think there's no difference between cosine and L2 (i.e., search results are the same) if queries and documents have constant norms. They don't have to be normalized to the unit norm, I think any constant would suffice: L2=(a-b)^2 = |a|^2 - ab + |b|^2 = const1 - cosine_similarity(a,b) * const2 What do you think? cc @jmazanec15 Additional experiments are always good. I have no immediate suggestion for realistic embeddings, though. However, one can try to stress test the method by synthetically changing the data: Multiply or divide vector elements by a random uniform number in the range [1, M]. For large enough M, the transform might become beneficial. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org