Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

via GitHub Tue, 18 Mar 2025 07:49:04 -0700


benwtrent commented on issue #14342:
URL: https://github.com/apache/lucene/issues/14342#issuecomment-2733531171


   First, thank you @lpld for digging in and running these benchmarks!
   
   OK, I think I see the weirdness with the `mnist` data set. Its not about it 
being a transformer model, it has to do with the distribution of the 
components. 
   
   I think we can significantly improve performance here for non-normally 
distributed vector components. 
   
   Let me illustrate. 
   
   here is the centroid centered distribution of e5-small over the quora 
dataset:
   
   
![Image](https://github.com/user-attachments/assets/4187d470-f109-4c3b-b447-b28dbdf4541f)
   
   Here is the centroid centered distribution of fashion-minst:
   
   
![Image](https://github.com/user-attachments/assets/695c232d-b4b8-4e93-b8c3-dbf20f33f038)
   
   Not normal at all. 
   
   GIST-1M is an example of a dataset that isn't "optimal", but still works:
   
   
![Image](https://github.com/user-attachments/assets/b7253633-9af1-4650-8c4a-9e416c28e7e7)
   
   
   
   The initialization parameters for optimized scalar quantization makes an 
assumption around the distribution of vector components. However, I think we 
can improve this by:
   
   Option 0:
   
   There might just be a bug...I will spend some time seeing if I can find 
one...
   
   Option 1:
    - testing the distribution of the components to verify normality. This can 
be done safely over a sample size of the vector set without too much compute 
power
    - Adjust the initialization parameters for the anisotropic loss 
optimizations.
   
   Option 2:
   There might be something simpler by just allowing folks to provide a static 
confidence as the initialization parameter. This would by-pass our 
initialization parameters and do anisotropic loss from the calculated 
intervals. 
   
   Option 3 (really not an option with HNSW i think):
   
   Another option is to utilize multiple centroids, however, using multiple 
centroids without HNSW actually knowing about them is incredibly inefficient 
and will cause compute issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]

Reply via email to