jimczi commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2445295372
The more I think about it, the less I feel like the knn codec is the best choice for this feature (assuming that this issue is focused on late interaction models). > It is possible that HNSW is not the ideal data structure to expose multi-vector ANN. We don't really change much in hnsw impl, except using multi-vector similarity for comparisons (graph build and search). Users can use the PerFieldKnnVectorsFormat to wire different data structures on top of the flat multi-vector format. We can also provide something off the box in a subsequent change. I think the aggregation fn. interface is also flexible for different types of similarity implementations? Using the knn codec to handle multi-vectors seems limiting, especially since it treats multi-vectors as a single unit for scoring. This works well for late interaction models, where we’re dealing with a collection of embeddings, but it’s restrictive if we want to index each vector separately. Using the original max similarity for HNSW is just not practical, it doesn’t scale, and I don’t think it’s something we’d actually want to support. It could be helpful to explore other options instead of relying on the knn codec alone. Along those lines, I created a quick draft of a [`LateInteractionField` using binary doc values](https://github.com/apache/lucene/compare/main...jimczi:lucene:late_interaction_field?expand=1), which keeps things simple and avoids major changes to the knn codec. I don’t think the flat vector format really offers any advantages over using binary doc values. In both cases, we’re able to store plain dense vectors as bytes, so there doesn’t seem to be a clear benefit to using the flat format here. What do you think of this approach? It feels like we could skip the full knn framework if our main goal is just to score a bag of embeddings. This would keep things simpler and allow us to focus specifically on max similarity scoring without the added weight of the full knn codec. My main worry is that adding multi-vectors to the knn codec as a late interaction model might add complexity later. It’s really two different approaches, and it seems valuable to keep the option for indexing each vector separately. We could expose this flexibility through the aggregation function, but that might complicate things across all codecs, as they’d need to handle both aggregate and independent cases efficiently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org