jimczi commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2441247130
> it seems like single vector is a special form of multi-vector The solution really depends on the semantics. In its current form, the way multi-vectors are incorporated in this PR doesn’t quite extend the single-vector case. With max similarity, we assume that each similarity score results from a full comparison, which works well when the operations are limited (such as in re-ranking scenarios). However, for ColBERT, where the average number of vectors per document is large (in the hundreds or thousands), using HNSW with max similarity layered on top may not be the optimal approach. This is likely why other vector libraries don’t expose this setup. If our aim is to introduce max similarity in Lucene, we might need to explore a more effective strategy. Although nested vectors could be promising, they’re currently constrained by the 2B vector limit, which isn’t ideal for ColBERT, given that each input token is represented as a dense vector. The primary limitation with HNSW and the knn codec today seems to be this 2B cap on vectors. Given these factors, we may want to reconsider HNSW for this purpose. A scalable solution would likely involve running multiple queries (one per query vector) rather than relying on an aggregation strategy. Maybe the first goal should be to incorporate max sim for re-ranking use cases first using a flat format? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org