pmpailis opened a new pull request, #13076:
URL: https://github.com/apache/lucene/pull/13076

   
   This PR adds support for binary Hamming distance as a similarity metric for 
byte vectors. The drive behind this is that there is an increasing interest in 
applying hashing techniques for embeddings (both in text & image based search 
applications e.g. [Binary Passage Retriever](https://arxiv.org/abs/2106.00882)) 
due to their much reduced size & performance gains that one might get. A 
natural way to compare binary vectors is to use Hamming distance by computing 
the population count of the XOR result of two embeddings, i.e. count how many 
different bits exist in the two binary vectors. 
   
   In Lucene, we can leverage the existing `byte[]` support, and use that to 
store the binary vectors. The size of the byte vectors would be `d / 8` or `(d 
/ 8) + 1` if `d % 8 > 0`, where d is the dimension of a binary vector. So for 
example, a binary vector of 64 bits, could use a `KnnByteVectorField` with 
`vectorDimension=8`. However, this transformation is currently outside of the 
scope of this PR. 
   
   To compute the Hamming distance, since we have a bounded pool of values 
ranging from `Byte.MIN_VALUE` to `Byte.MAX_VALUE`, this PR makes use of a look 
up table to retrieve the appropriate population count. Similarly, for the 
Panama implementation, we rely on the approach discussed 
[here](https://arxiv.org/abs/1611.07612) to compute the population count of low 
& high bits of the vectors' XOR result using a look-up table as well. 
   
   To convert the computed distance to a similarity score we finaly normalize 
through `1 / (1 + hamming_distance)`
   
   
   Benchmarks for the scalar & vectorized implementations running on my M2 Pro 
(Neon) dev machine: 
   ```
   Benchmark                                        (size)   Mode  Cnt    Score 
  Error   Units
   VectorUtilBenchmark.binaryHammingDistanceScalar       1  thrpt   15  514.613 
± 6.496  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar       8  thrpt   15  216.716 
± 1.682  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar      16  thrpt   15  135.528 
± 1.606  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar      32  thrpt   15   76.745 
± 0.654  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar      50  thrpt   15   52.226 
± 0.444  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar      64  thrpt   15   41.246 
± 0.139  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar     100  thrpt   15   29.119 
± 0.086  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar     128  thrpt   15   22.639 
± 0.138  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar     207  thrpt   15   14.382 
± 0.058  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar     256  thrpt   15   11.813 
± 0.060  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar     300  thrpt   15   10.253 
± 0.257  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar     512  thrpt   15    6.145 
± 0.015  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar     702  thrpt   15    4.461 
± 0.133  ops/us
   VectorUtilBenchmark.binaryHammingDistanceScalar    1024  thrpt   15    3.091 
± 0.003  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector       1  thrpt   15  499.861 
± 0.476  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector       8  thrpt   15  191.430 
± 3.243  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector      16  thrpt   15  298.697 
± 4.448  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector      32  thrpt   15  222.700 
± 5.461  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector      50  thrpt   15  129.853 
± 0.325  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector      64  thrpt   15  156.657 
± 4.337  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector     100  thrpt   15   83.879 
± 1.864  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector     128  thrpt   15   88.028 
± 2.156  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector     207  thrpt   15   43.573 
± 1.085  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector     256  thrpt   15   49.415 
± 0.865  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector     300  thrpt   15   34.535 
± 0.408  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector     512  thrpt   15   25.953 
± 0.197  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector     702  thrpt   15   17.483 
± 0.033  ops/us
   VectorUtilBenchmark.binaryHammingDistanceVector    1024  thrpt   15   13.305 
± 0.085  ops/us
   ```
   
   `BINARY_HAMMING_DISTANCE` is currently only supported for `byte[]` vectors 
so I had to slightly refactor some tests to distinguish between float and byte 
versions of knn-search. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to