pmpailis commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1479909430
########## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ########## @@ -94,6 +95,29 @@ public float compare(float[] v1, float[] v2) { public float compare(byte[] v1, byte[] v2) { return scaleMaxInnerProductScore(dotProduct(v1, v2)); } + }, + /** + * Binary Hamming distance; Computes how many bits are different in two bytes. + * + * <p>Only supported for bytes. To convert the distance to a similarity score we normalize using 1 + * / (1 + hammingDistance) + */ + BINARY_HAMMING_DISTANCE { + @Override + public float compare(float[] v1, float[] v2) { + throw new UnsupportedOperationException( + BINARY_HAMMING_DISTANCE.name() + " is only supported for byte vectors"); + } + + @Override + public float compare(byte[] v1, byte[] v2) { + return (1f / (1 + binaryHammingDistance(v1, v2))); Review Comment: I see your point. The initial idea was to have the score bounded in `(0, 1]` so to have more a "natural" way of interpreting it, i.e. 1 will always mean identical, and ~0 will mean that the two vectors are complements of each other (`1/(1+dim)`). If we are to scale the score based on the number of dimensions, we move this to `(0, dimensions*8]` which will effectively be the reverse of the distance. So for example if two vectors are identical, they would have a score of `dimensions * 8`, whereas if one is complement of the other, their score would be ~1 (`dim/(1+dim)` ). Don't have a strong opinion on this, happy to proceed with updating the normalization constant if you prefer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org