uschindler commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1479835140
########## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ########## @@ -94,6 +95,29 @@ public float compare(float[] v1, float[] v2) { public float compare(byte[] v1, byte[] v2) { return scaleMaxInnerProductScore(dotProduct(v1, v2)); } + }, + /** + * Binary Hamming distance; Computes how many bits are different in two bytes. + * + * <p>Only supported for bytes. To convert the distance to a similarity score we normalize using 1 + * / (1 + hammingDistance) + */ + BINARY_HAMMING_DISTANCE { + @Override + public float compare(float[] v1, float[] v2) { + throw new UnsupportedOperationException( + BINARY_HAMMING_DISTANCE.name() + " is only supported for byte vectors"); + } + + @Override + public float compare(byte[] v1, byte[] v2) { + return (1f / (1 + binaryHammingDistance(v1, v2))); Review Comment: This depends on vector length, is this intended? I would have expected to have something like `dimensions * 8 / (1 + distance)`. I know, it is not relevant for scoring purposes as it is a constant factor, but we have some normalization on other functions, too. ########## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ########## @@ -214,4 +214,19 @@ public static float[] checkFinite(float[] v) { } return v; } + + public static int binaryHammingDistance(byte[] a, byte[] b) { + int distance = 0, i = 0; + for (final int upperBound = a.length & ~(Long.BYTES - 1); i < upperBound; i += Long.BYTES) { + distance += + Long.bitCount( + ((long) BitUtil.VH_NATIVE_LONG.get(a, i) ^ (long) BitUtil.VH_NATIVE_LONG.get(b, i)) + & 0xFFFFFFFFFFFFFFFFL); Review Comment: remove the `& 0xFFFFFFFFFFFFFFFFL`, it's useless. See my previous comment with the "final version": https://github.com/apache/lucene/pull/13076#issuecomment-1928027541 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org