benwtrent commented on code in PR #13076:
URL: https://github.com/apache/lucene/pull/13076#discussion_r1478274441
##########
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##########
@@ -214,4 +214,11 @@ public static float[] checkFinite(float[] v) {
}
return v;
}
+
+ public static float binaryHammingDistance(byte[] a, byte[] b) {
+ if (a.length != b.length) {
+ throw new IllegalArgumentException("vector dimensions differ: " +
a.length + "!=" + b.length);
+ }
+ return 1f / (1 + IMPL.binaryHammingDistance(a, b));
Review Comment:
This should return `IMPL.binaryHammingDistance(a, b)`.
The user's of this function will have to transform the distance to a score
separately.
##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -116,4 +143,14 @@ public float compare(byte[] v1, byte[] v2) {
* @return the value of the similarity function applied to the two vectors
*/
public abstract float compare(byte[] v1, byte[] v2);
+
+ /**
+ * Defines which encodings are supported by the similarity function - used
in tests to control
+ * randomization
+ *
+ * @return a list of all supported VectorEncodings for the given similarity
+ */
+ public Set<VectorEncoding> supportedVectorEncodings() {
+ return EnumSet.of(VectorEncoding.BYTE, VectorEncoding.FLOAT32);
+ }
Review Comment:
Do we do anything with the set except check membership? I think a simple
`bool supportsVectorEncoding(VectorEncoding)` function would be preferred.
##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -94,6 +98,29 @@ public float compare(float[] v1, float[] v2) {
public float compare(byte[] v1, byte[] v2) {
return scaleMaxInnerProductScore(dotProduct(v1, v2));
}
+ },
+ /**
+ * Binary Hamming distance; Computes how many bits are different in two
bytes.
+ *
+ * <p>Only supported for bytes. To convert the distance to a similarity
score we normalize using 1
+ * / (1 + hammingDistance)
+ */
+ BINARY_HAMMING_DISTANCE {
+ @Override
+ public float compare(float[] v1, float[] v2) {
+ throw new UnsupportedOperationException(
+ BINARY_HAMMING_DISTANCE.name() + " is only supported for byte
vectors");
+ }
+
+ @Override
+ public float compare(byte[] v1, byte[] v2) {
+ return binaryHammingDistance(v1, v2);
Review Comment:
The score transformation for the distance should be here. Note
`squareDistance`
##########
lucene/core/src/test/org/apache/lucene/index/KnnGraphTestCase.java:
##########
@@ -54,35 +55,65 @@
import org.apache.lucene.util.Bits;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.IOUtils;
-import org.apache.lucene.util.VectorUtil;
import org.apache.lucene.util.hnsw.HnswGraph;
-import org.apache.lucene.util.hnsw.HnswGraph.NodesIterator;
import org.apache.lucene.util.hnsw.HnswGraphBuilder;
import org.junit.After;
import org.junit.Before;
/** Tests indexing of a knn-graph */
-public class TestKnnGraph extends LuceneTestCase {
Review Comment:
This is a fairly large refactor adding test coverage, but I think it
detracts from this PR. Could you change all this back and restrict the
similarity function when using floats?
There are so many abstract functions, etc. it seems like this entire test
case should be rewritten from the ground up if we were to 100% test byte vs.
float for it.
##########
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizedVectorSimilarity.java:
##########
@@ -40,6 +40,8 @@ static ScalarQuantizedVectorSimilarity fromVectorSimilarity(
case EUCLIDEAN -> new Euclidean(constMultiplier);
case COSINE, DOT_PRODUCT -> new DotProduct(constMultiplier);
case MAXIMUM_INNER_PRODUCT -> new MaximumInnerProduct(constMultiplier);
+ case BINARY_HAMMING_DISTANCE -> throw new IllegalArgumentException(
+ "Cannot use Hamming distance with scalar quantization");
};
Review Comment:
It would be good if the error message had the literal
`BINARY_HAMMING_DISTANCE` string to indicate the similarity enumeration we
failed at.
##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -94,6 +98,29 @@ public float compare(float[] v1, float[] v2) {
public float compare(byte[] v1, byte[] v2) {
return scaleMaxInnerProductScore(dotProduct(v1, v2));
}
+ },
+ /**
+ * Binary Hamming distance; Computes how many bits are different in two
bytes.
+ *
+ * <p>Only supported for bytes. To convert the distance to a similarity
score we normalize using 1
+ * / (1 + hammingDistance)
+ */
+ BINARY_HAMMING_DISTANCE {
+ @Override
+ public float compare(float[] v1, float[] v2) {
+ throw new UnsupportedOperationException(
+ BINARY_HAMMING_DISTANCE.name() + " is only supported for byte
vectors");
+ }
+
+ @Override
+ public float compare(byte[] v1, byte[] v2) {
+ return binaryHammingDistance(v1, v2);
+ }
+
+ @Override
+ public Set<VectorEncoding> supportedVectorEncodings() {
+ return EnumSet.of(VectorEncoding.BYTE);
+ }
Review Comment:
```suggestion
public bool supportsVectorEncoding(VectorEncoding encoding) {
return encoding == VectorEncoding.BYTE;
}
```
I am not sure why this returns a set. Do we really do anything with the set
except check membership?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]