vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2455892813
One use-case for multi-vectors is indexing product aspects as separate embeddings for e-commerce search. At Amazon Product Search (where I work), we'd like to experiment with separate embeddings to represent product attributes, user product opinions, and product images. Such e-commerce use-cases would have a limited set of embeddings, but leverage similarity computations across all of them. I see your point about scaling challenges with very high cardinality multi-vectors like token level ColBERT embeddings. Keeping them in a `BinaryDocValues` field is a good idea for scoring only applications. I like the `LateInteractionField` wrapper you shared, we should bring it into Lucene for such usecases. However, I do think there is space for both solutions. It's not obvious to me how the knn codec gets polluted for future complexity. We would still support single vectors as is. My mental model is: if you want to use multi-vectors in nearest neighbor search (hnsw or newer algos later), index them in the knn field. Otherwise, index them separately as doc-values used only for re-ranking top results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org