benwtrent commented on issue #14342: URL: https://github.com/apache/lucene/issues/14342#issuecomment-2733531171
First, thank you @lpld for digging in and running these benchmarks! OK, I think I see the weirdness with the `mnist` data set. Its not about it being a transformer model, it has to do with the distribution of the components. I think we can significantly improve performance here for non-normally distributed vector components. Let me illustrate. here is the centroid centered distribution of e5-small over the quora dataset:  Here is the centroid centered distribution of fashion-minst:  Not normal at all. GIST-1M is an example of a dataset that isn't "optimal", but still works:  The initialization parameters for optimized scalar quantization makes an assumption around the distribution of vector components. However, I think we can improve this by: Option 0: There might just be a bug...I will spend some time seeing if I can find one... Option 1: - testing the distribution of the components to verify normality. This can be done safely over a sample size of the vector set without too much compute power - Adjust the initialization parameters for the anisotropic loss optimizations. Option 2: There might be something simpler by just allowing folks to provide a static confidence as the initialization parameter. This would by-pass our initialization parameters and do anisotropic loss from the calculated intervals. Option 3 (really not an option with HNSW i think): Another option is to utilize multiple centroids, however, using multiple centroids without HNSW actually knowing about them is incredibly inefficient and will cause compute issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org