Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

via GitHub Mon, 05 Feb 2024 06:17:28 -0800


benwtrent commented on code in PR #13076:
URL: https://github.com/apache/lucene/pull/13076#discussion_r1478274441



##########
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##########
@@ -214,4 +214,11 @@ public static float[] checkFinite(float[] v) {
     }
     return v;
   }
+
+  public static float binaryHammingDistance(byte[] a, byte[] b) {
+    if (a.length != b.length) {
+      throw new IllegalArgumentException("vector dimensions differ: " + 
a.length + "!=" + b.length);
+    }
+    return 1f / (1 + IMPL.binaryHammingDistance(a, b));

Review Comment:
   This should return `IMPL.binaryHammingDistance(a, b)`. 
   
   The user's of this function will have to transform the distance to a score 
separately.



##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -116,4 +143,14 @@ public float compare(byte[] v1, byte[] v2) {
    * @return the value of the similarity function applied to the two vectors
    */
   public abstract float compare(byte[] v1, byte[] v2);
+
+  /**
+   * Defines which encodings are supported by the similarity function - used 
in tests to control
+   * randomization
+   *
+   * @return a list of all supported VectorEncodings for the given similarity
+   */
+  public Set<VectorEncoding> supportedVectorEncodings() {
+    return EnumSet.of(VectorEncoding.BYTE, VectorEncoding.FLOAT32);
+  }

Review Comment:
   Do we do anything with the set except check membership? I think a simple 
`bool supportsVectorEncoding(VectorEncoding)` function would be preferred.



##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -94,6 +98,29 @@ public float compare(float[] v1, float[] v2) {
     public float compare(byte[] v1, byte[] v2) {
       return scaleMaxInnerProductScore(dotProduct(v1, v2));
     }
+  },
+  /**
+   * Binary Hamming distance; Computes how many bits are different in two 
bytes.
+   *
+   * <p>Only supported for bytes. To convert the distance to a similarity 
score we normalize using 1
+   * / (1 + hammingDistance)
+   */
+  BINARY_HAMMING_DISTANCE {
+    @Override
+    public float compare(float[] v1, float[] v2) {
+      throw new UnsupportedOperationException(
+          BINARY_HAMMING_DISTANCE.name() + " is only supported for byte 
vectors");
+    }
+
+    @Override
+    public float compare(byte[] v1, byte[] v2) {
+      return binaryHammingDistance(v1, v2);

Review Comment:
   The score transformation for the distance should be here. Note 
`squareDistance`



##########
lucene/core/src/test/org/apache/lucene/index/KnnGraphTestCase.java:
##########
@@ -54,35 +55,65 @@
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 import org.apache.lucene.util.IOUtils;
-import org.apache.lucene.util.VectorUtil;
 import org.apache.lucene.util.hnsw.HnswGraph;
-import org.apache.lucene.util.hnsw.HnswGraph.NodesIterator;
 import org.apache.lucene.util.hnsw.HnswGraphBuilder;
 import org.junit.After;
 import org.junit.Before;
 
 /** Tests indexing of a knn-graph */
-public class TestKnnGraph extends LuceneTestCase {

Review Comment:
   This is a fairly large refactor adding test coverage, but I think it 
detracts from this PR. Could you change all this back and restrict the 
similarity function when using floats?
   
   There are so many abstract functions, etc. it seems like this entire test 
case should be rewritten from the ground up if we were to 100% test byte vs. 
float for it.



##########
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizedVectorSimilarity.java:
##########
@@ -40,6 +40,8 @@ static ScalarQuantizedVectorSimilarity fromVectorSimilarity(
       case EUCLIDEAN -> new Euclidean(constMultiplier);
       case COSINE, DOT_PRODUCT -> new DotProduct(constMultiplier);
       case MAXIMUM_INNER_PRODUCT -> new MaximumInnerProduct(constMultiplier);
+      case BINARY_HAMMING_DISTANCE -> throw new IllegalArgumentException(
+          "Cannot use Hamming distance with scalar quantization");
     };

Review Comment:
   It would be good if the error message had the literal 
`BINARY_HAMMING_DISTANCE` string to indicate the similarity enumeration we 
failed at.



##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -94,6 +98,29 @@ public float compare(float[] v1, float[] v2) {
     public float compare(byte[] v1, byte[] v2) {
       return scaleMaxInnerProductScore(dotProduct(v1, v2));
     }
+  },
+  /**
+   * Binary Hamming distance; Computes how many bits are different in two 
bytes.
+   *
+   * <p>Only supported for bytes. To convert the distance to a similarity 
score we normalize using 1
+   * / (1 + hammingDistance)
+   */
+  BINARY_HAMMING_DISTANCE {
+    @Override
+    public float compare(float[] v1, float[] v2) {
+      throw new UnsupportedOperationException(
+          BINARY_HAMMING_DISTANCE.name() + " is only supported for byte 
vectors");
+    }
+
+    @Override
+    public float compare(byte[] v1, byte[] v2) {
+      return binaryHammingDistance(v1, v2);
+    }
+
+    @Override
+    public Set<VectorEncoding> supportedVectorEncodings() {
+      return EnumSet.of(VectorEncoding.BYTE);
+    }

Review Comment:
   ```suggestion
       public bool supportsVectorEncoding(VectorEncoding encoding) {
         return encoding == VectorEncoding.BYTE;
       }
   ```
   I am not sure why this returns a set. Do we really do anything with the set 
except check membership?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

Reply via email to