lpld opened a new issue, #14342: URL: https://github.com/apache/lucene/issues/14342
Hi lucene team. Last week I've been playing with the [quantization format](https://github.com/apache/lucene/pull/14078) that's been recently added to lucene. Main idea was to take the datasets from [ann-benchmarks](https://github.com/erikbern/ann-benchmarks) and run knn benchmarks on them with the new lucene quantization. I measured only recall at this point. Most of those datasets are low-dimensional, and the results were not as good as I expected. In fact, in most of the cases even naive binary quantization using `ScalarQuantizer` performed better. However on high dimensional datasets as `Gist1m` and `Coco-i2i` the results were really good. I have already started this discussion in the [pull request](https://github.com/apache/lucene/pull/14078#issuecomment-2713454213) itself, and as I understood this new quantization is supposed to perform good only with high-dimensional vectors and supposedly only with text embeddings (because, for instance, 784-dimensional MNIST had only 6% recall). Anyway, I thought that I'd raise this question in a separate issue. Please, feel free to just close it if it is irrelevant. Here are the results that I've got. All tests were run with these parameters: topK = 100, maxConn=64, beamWidth=250, fanout = 100, overSample = 5 - Glove25 (1_183_514 x 25) - non-quantized: 0.999 - quantized: 0.342 - Glove100 (1_183_514 x 100) - non-quantized: 0.923 - quantized: 0.504 - Glove200 (1_183_514 x 200) - non-quantized: 0.874 - quantized: 0.525 - Mnist784 (60_000 x 784) - non-quantized: 1.000 - quantized: 0.062 - FashionMnist784 (60_000 x 784) - non-quantized: 1.000 - quantized: 0.018 - LastFm64 (292_385 x 65) - non-quantized: 0.999 - quantized: 0.381 - Coco-i2i (113_287 x 512) - non-quantized: 1.000 - quantized: 0.972 - Coco-t2i (113_287 x 512) - non-quantized: 0.992 - quantized: 0.567 - SiftSmall (10_000 x 128) - non-quantized: 1.0 - quantized: 0.31 - Sift (1_000_000 x 128) - non-quantized: 0.999 - quantized: 0.235 - Gist (1_000_000 x 960) - non-quantized: 0.994 - quantized: 0.987 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org