jimczi commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2445295372

   The more I think about it, the less I feel like the knn codec is the best 
choice for this feature (assuming that this issue is focused on late 
interaction models).
   
   > It is possible that HNSW is not the ideal data structure to expose 
multi-vector ANN. We don't really change much in hnsw impl, except using 
multi-vector similarity for comparisons (graph build and search). Users can use 
the PerFieldKnnVectorsFormat to wire different data structures on top of the 
flat multi-vector format. We can also provide something off the box in a 
subsequent change. I think the aggregation fn. interface is also flexible for 
different types of similarity implementations?
   
   Using the knn codec to handle multi-vectors seems limiting, especially since 
it treats multi-vectors as a single unit for scoring. This works well for late 
interaction models, where we’re dealing with a collection of embeddings, but 
it’s restrictive if we want to index each vector separately.
   Using the original max similarity for HNSW is just not practical, it doesn’t 
scale, and I don’t think it’s something we’d actually want to support. 
   
   It could be helpful to explore other options instead of relying on the knn 
codec alone. Along those lines, I created a quick draft of a 
[`LateInteractionField` using binary doc 
values](https://github.com/apache/lucene/compare/main...jimczi:lucene:late_interaction_field?expand=1),
 which keeps things simple and avoids major changes to the knn codec. I don’t 
think the flat vector format really offers any advantages over using binary doc 
values. In both cases, we’re able to store plain dense vectors as bytes, so 
there doesn’t seem to be a clear benefit to using the flat format here.
   
   What do you think of this approach? It feels like we could skip the full knn 
framework if our main goal is just to score a bag of embeddings. This would 
keep things simpler and allow us to focus specifically on max similarity 
scoring without the added weight of the full knn codec.
   
   My main worry is that adding multi-vectors to the knn codec as a late 
interaction model might add complexity later. It’s really two different 
approaches, and it seems valuable to keep the option for indexing each vector 
separately. We could expose this flexibility through the aggregation function, 
but that might complicate things across all codecs, as they’d need to handle 
both aggregate and independent cases efficiently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to