[I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

via GitHub Tue, 11 Mar 2025 08:09:45 -0700


lpld opened a new issue, #14342:
URL: https://github.com/apache/lucene/issues/14342


   Hi lucene team. Last week I've been playing with the [quantization 
format](https://github.com/apache/lucene/pull/14078) that's been recently added 
to lucene. Main idea was to take the datasets from 
[ann-benchmarks](https://github.com/erikbern/ann-benchmarks) and run knn 
benchmarks on them with the new lucene quantization. I measured only recall at 
this point. Most of those datasets are low-dimensional, and the results were 
not as good as I expected. In fact, in most of the cases even naive binary 
quantization using `ScalarQuantizer` performed better. However on high 
dimensional datasets as `Gist1m` and `Coco-i2i` the results were really good.
   
   I have already started this discussion in the [pull 
request](https://github.com/apache/lucene/pull/14078#issuecomment-2713454213) 
itself, and as I understood this new quantization is supposed to perform good 
only with high-dimensional vectors and supposedly only with text embeddings 
(because, for instance, 784-dimensional MNIST had only 6% recall). Anyway, I 
thought that I'd raise this question in a separate issue. Please, feel free to 
just close it if it is irrelevant.
   
   Here are the results that I've got. All tests were run with these 
parameters: topK = 100, maxConn=64, beamWidth=250, fanout = 100, overSample = 5
   
   - Glove25 (1_183_514 x 25)
     - non-quantized: 0.999
     - quantized: 0.342
   - Glove100 (1_183_514 x 100)
     - non-quantized: 0.923
     - quantized: 0.504
   - Glove200 (1_183_514 x 200)
     - non-quantized: 0.874
     - quantized: 0.525
   - Mnist784 (60_000 x 784)
     - non-quantized: 1.000
     - quantized: 0.062
   - FashionMnist784 (60_000 x 784)
     - non-quantized: 1.000
     - quantized: 0.018
   - LastFm64 (292_385 x 65)
     - non-quantized: 0.999
     - quantized: 0.381
   - Coco-i2i (113_287 x 512)
     - non-quantized: 1.000
     - quantized: 0.972
   - Coco-t2i (113_287 x 512)
     - non-quantized: 0.992
     - quantized: 0.567
   - SiftSmall (10_000 x 128)
     - non-quantized: 1.0
     - quantized: 0.31
   - Sift (1_000_000 x 128)
     - non-quantized: 0.999
     - quantized: 0.235
   - Gist (1_000_000 x 960)
     - non-quantized: 0.994
     - quantized: 0.987


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

Reply via email to