benwtrent commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1929537735
> My question is why add this function when it's not that much faster than integer dot product? Because it provides different scores. Integer dot-product doesn't provide the same values (angle between vectors) and doesn't work for binary encoded data (vs. euclidean bit distance). Hamming distance is more a like `euclidean`. It is possible to do "hamming distance things" now, if users give specifically `[0, 1, 0, 1, 1...]` and use `euclidean`, but this has obvious draw backs (8x more vector operations and vector dims are 8x bigger). And before you suggest "lets remove `euclidean` then", they are not compatible other than users providing literal `1s/0s`. > The issue is that folks just want to add, add, add these functions yet there are no ways to remove any function from this list ( they will scream "bwc" ). If you are against this & will block it, then we need to provide a clean way for users to introduce their own similarities. I suggested making similarities pluggable in the past, but got shot down. > A good way to get in a new function would be to actually improve our support o&m by removing a horribly performing one such as cosine first. That way we are actually improving rather than just piling on more code. If hamming and cosine were comparable, then sure. But they are not. I do agree cosine should probably be removed (not because of hamming distance), but because dot_product exists. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org