pmpailis opened a new pull request, #13076: URL: https://github.com/apache/lucene/pull/13076
This PR adds support for binary Hamming distance as a similarity metric for byte vectors. The drive behind this is that there is an increasing interest in applying hashing techniques for embeddings (both in text & image based search applications e.g. [Binary Passage Retriever](https://arxiv.org/abs/2106.00882)) due to their much reduced size & performance gains that one might get. A natural way to compare binary vectors is to use Hamming distance by computing the population count of the XOR result of two embeddings, i.e. count how many different bits exist in the two binary vectors. In Lucene, we can leverage the existing `byte[]` support, and use that to store the binary vectors. The size of the byte vectors would be `d / 8` or `(d / 8) + 1` if `d % 8 > 0`, where d is the dimension of a binary vector. So for example, a binary vector of 64 bits, could use a `KnnByteVectorField` with `vectorDimension=8`. However, this transformation is currently outside of the scope of this PR. To compute the Hamming distance, since we have a bounded pool of values ranging from `Byte.MIN_VALUE` to `Byte.MAX_VALUE`, this PR makes use of a look up table to retrieve the appropriate population count. Similarly, for the Panama implementation, we rely on the approach discussed [here](https://arxiv.org/abs/1611.07612) to compute the population count of low & high bits of the vectors' XOR result using a look-up table as well. To convert the computed distance to a similarity score we finaly normalize through `1 / (1 + hamming_distance)` Benchmarks for the scalar & vectorized implementations running on my M2 Pro (Neon) dev machine: ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryHammingDistanceScalar 1 thrpt 15 514.613 ± 6.496 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 8 thrpt 15 216.716 ± 1.682 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 16 thrpt 15 135.528 ± 1.606 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 32 thrpt 15 76.745 ± 0.654 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 50 thrpt 15 52.226 ± 0.444 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 64 thrpt 15 41.246 ± 0.139 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 100 thrpt 15 29.119 ± 0.086 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 128 thrpt 15 22.639 ± 0.138 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 207 thrpt 15 14.382 ± 0.058 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 256 thrpt 15 11.813 ± 0.060 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 300 thrpt 15 10.253 ± 0.257 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 512 thrpt 15 6.145 ± 0.015 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 702 thrpt 15 4.461 ± 0.133 ops/us VectorUtilBenchmark.binaryHammingDistanceScalar 1024 thrpt 15 3.091 ± 0.003 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 1 thrpt 15 499.861 ± 0.476 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 8 thrpt 15 191.430 ± 3.243 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 16 thrpt 15 298.697 ± 4.448 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 32 thrpt 15 222.700 ± 5.461 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 50 thrpt 15 129.853 ± 0.325 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 64 thrpt 15 156.657 ± 4.337 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 100 thrpt 15 83.879 ± 1.864 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 128 thrpt 15 88.028 ± 2.156 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 207 thrpt 15 43.573 ± 1.085 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 256 thrpt 15 49.415 ± 0.865 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 300 thrpt 15 34.535 ± 0.408 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 512 thrpt 15 25.953 ± 0.197 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 702 thrpt 15 17.483 ± 0.033 ops/us VectorUtilBenchmark.binaryHammingDistanceVector 1024 thrpt 15 13.305 ± 0.085 ops/us ``` `BINARY_HAMMING_DISTANCE` is currently only supported for `byte[]` vectors so I had to slightly refactor some tests to distinguish between float and byte versions of knn-search. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org