benwtrent commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1478274441
########## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ########## @@ -214,4 +214,11 @@ public static float[] checkFinite(float[] v) { } return v; } + + public static float binaryHammingDistance(byte[] a, byte[] b) { + if (a.length != b.length) { + throw new IllegalArgumentException("vector dimensions differ: " + a.length + "!=" + b.length); + } + return 1f / (1 + IMPL.binaryHammingDistance(a, b)); Review Comment: This should return `IMPL.binaryHammingDistance(a, b)`. The user's of this function will have to transform the distance to a score separately. ########## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ########## @@ -116,4 +143,14 @@ public float compare(byte[] v1, byte[] v2) { * @return the value of the similarity function applied to the two vectors */ public abstract float compare(byte[] v1, byte[] v2); + + /** + * Defines which encodings are supported by the similarity function - used in tests to control + * randomization + * + * @return a list of all supported VectorEncodings for the given similarity + */ + public Set<VectorEncoding> supportedVectorEncodings() { + return EnumSet.of(VectorEncoding.BYTE, VectorEncoding.FLOAT32); + } Review Comment: Do we do anything with the set except check membership? I think a simple `bool supportsVectorEncoding(VectorEncoding)` function would be preferred. ########## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ########## @@ -94,6 +98,29 @@ public float compare(float[] v1, float[] v2) { public float compare(byte[] v1, byte[] v2) { return scaleMaxInnerProductScore(dotProduct(v1, v2)); } + }, + /** + * Binary Hamming distance; Computes how many bits are different in two bytes. + * + * <p>Only supported for bytes. To convert the distance to a similarity score we normalize using 1 + * / (1 + hammingDistance) + */ + BINARY_HAMMING_DISTANCE { + @Override + public float compare(float[] v1, float[] v2) { + throw new UnsupportedOperationException( + BINARY_HAMMING_DISTANCE.name() + " is only supported for byte vectors"); + } + + @Override + public float compare(byte[] v1, byte[] v2) { + return binaryHammingDistance(v1, v2); Review Comment: The score transformation for the distance should be here. Note `squareDistance` ########## lucene/core/src/test/org/apache/lucene/index/KnnGraphTestCase.java: ########## @@ -54,35 +55,65 @@ import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.IOUtils; -import org.apache.lucene.util.VectorUtil; import org.apache.lucene.util.hnsw.HnswGraph; -import org.apache.lucene.util.hnsw.HnswGraph.NodesIterator; import org.apache.lucene.util.hnsw.HnswGraphBuilder; import org.junit.After; import org.junit.Before; /** Tests indexing of a knn-graph */ -public class TestKnnGraph extends LuceneTestCase { Review Comment: This is a fairly large refactor adding test coverage, but I think it detracts from this PR. Could you change all this back and restrict the similarity function when using floats? There are so many abstract functions, etc. it seems like this entire test case should be rewritten from the ground up if we were to 100% test byte vs. float for it. ########## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizedVectorSimilarity.java: ########## @@ -40,6 +40,8 @@ static ScalarQuantizedVectorSimilarity fromVectorSimilarity( case EUCLIDEAN -> new Euclidean(constMultiplier); case COSINE, DOT_PRODUCT -> new DotProduct(constMultiplier); case MAXIMUM_INNER_PRODUCT -> new MaximumInnerProduct(constMultiplier); + case BINARY_HAMMING_DISTANCE -> throw new IllegalArgumentException( + "Cannot use Hamming distance with scalar quantization"); }; Review Comment: It would be good if the error message had the literal `BINARY_HAMMING_DISTANCE` string to indicate the similarity enumeration we failed at. ########## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ########## @@ -94,6 +98,29 @@ public float compare(float[] v1, float[] v2) { public float compare(byte[] v1, byte[] v2) { return scaleMaxInnerProductScore(dotProduct(v1, v2)); } + }, + /** + * Binary Hamming distance; Computes how many bits are different in two bytes. + * + * <p>Only supported for bytes. To convert the distance to a similarity score we normalize using 1 + * / (1 + hammingDistance) + */ + BINARY_HAMMING_DISTANCE { + @Override + public float compare(float[] v1, float[] v2) { + throw new UnsupportedOperationException( + BINARY_HAMMING_DISTANCE.name() + " is only supported for byte vectors"); + } + + @Override + public float compare(byte[] v1, byte[] v2) { + return binaryHammingDistance(v1, v2); + } + + @Override + public Set<VectorEncoding> supportedVectorEncodings() { + return EnumSet.of(VectorEncoding.BYTE); + } Review Comment: ```suggestion public bool supportsVectorEncoding(VectorEncoding encoding) { return encoding == VectorEncoding.BYTE; } ``` I am not sure why this returns a set. Do we really do anything with the set except check membership? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org