benwtrent commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2624759420
I like where this PR is going. > Note: This change does not include dependent multi-valued vectors like ColBERT, where the multiple vectors must used together to compute similarity. It does however lay essential ground work which can subsequently be extended for this support. I think this PR is still doing globally unique ordinals for vectors? So, vectors `1, 2, 3` go to document `1` and ordinals `4, 5` go to doc `2`? If so, I think we should "bite the bullet" and make vector ordinals `long` values. I know this makes HNSW graph building 2x as expensive when it comes to memory usage. But it seems like something we should do. Models like ColPALI (and ColBERT) will index 100s or as much as 1k vectors per document. This will cause the number of vectors per lucene segment to be restricted to 2-20M, much lower than it is now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org