Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

via GitHub Thu, 30 Jan 2025 07:10:19 -0800


benwtrent commented on PR #14173:
URL: https://github.com/apache/lucene/pull/14173#issuecomment-2624759420


   I like where this PR is going.
   
   > Note: This change does not include dependent multi-valued vectors like 
ColBERT, where the multiple vectors must used together to compute similarity. 
It does however lay essential ground work which can subsequently be extended 
for this support.
   
   I think this PR is still doing globally unique ordinals for vectors? So, 
vectors `1, 2, 3` go to document `1` and ordinals `4, 5` go to doc `2`? If so, 
I think we should "bite the bullet" and make vector ordinals `long` values. I 
know this makes HNSW graph building 2x as expensive when it comes to memory 
usage. But it seems like something we should do.
   
   Models like ColPALI (and ColBERT) will index 100s or as much as 1k vectors 
per document. This will cause the number of vectors per lucene segment to be 
restricted to 2-20M, much lower than it is now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

Reply via email to