vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2455892813

   One use-case for multi-vectors is indexing product aspects as separate 
embeddings for e-commerce search. At Amazon Product Search (where I work), we'd 
like to experiment with separate embeddings to represent product attributes, 
user product opinions, and product images. Such e-commerce use-cases would have 
a limited set of embeddings, but leverage similarity computations across all of 
them.
   
   I see your point about scaling challenges with very high cardinality 
multi-vectors like token level ColBERT embeddings. Keeping them in a 
`BinaryDocValues` field is a good idea for scoring only applications. I like 
the `LateInteractionField` wrapper you shared, we should bring it into Lucene 
for such usecases.
   
   However, I do think there is space for both solutions. It's not obvious to 
me how the knn codec gets polluted for future complexity. We would still 
support single vectors as is. My mental model is: if you want to use 
multi-vectors in nearest neighbor search (hnsw or newer algos later), index 
them in the knn field. Otherwise, index them separately as doc-values used only 
for re-ranking top results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to