Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

via GitHub Tue, 29 Oct 2024 14:22:04 -0700


jimczi commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2445295372

The more I think about it, the less I feel like the knn codec is the best
choice for this feature (assuming that this issue is focused on late
interaction models).

> It is possible that HNSW is not the ideal data structure to expose
multi-vector ANN. We don't really change much in hnsw impl, except using
multi-vector similarity for comparisons (graph build and search). Users can use
the PerFieldKnnVectorsFormat to wire different data structures on top of the
flat multi-vector format. We can also provide something off the box in a
subsequent change. I think the aggregation fn. interface is also flexible for
different types of similarity implementations?

Using the knn codec to handle multi-vectors seems limiting, especially since
it treats multi-vectors as a single unit for scoring. This works well for late
interaction models, where we’re dealing with a collection of embeddings, but
it’s restrictive if we want to index each vector separately.
Using the original max similarity for HNSW is just not practical, it doesn’t
scale, and I don’t think it’s something we’d actually want to support.

It could be helpful to explore other options instead of relying on the knn
codec alone. Along those lines, I created a quick draft of a
[`LateInteractionField` using binary doc
values](https://github.com/apache/lucene/compare/main...jimczi:lucene:late_interaction_field?expand=1),
which keeps things simple and avoids major changes to the knn codec. I don’t
think the flat vector format really offers any advantages over using binary doc
values. In both cases, we’re able to store plain dense vectors as bytes, so
there doesn’t seem to be a clear benefit to using the flat format here.

What do you think of this approach? It feels like we could skip the full knn
framework if our main goal is just to score a bag of embeddings. This would
keep things simpler and allow us to focus specifically on max similarity
scoring without the added weight of the full knn codec.

My main worry is that adding multi-vectors to the knn codec as a late
interaction model might add complexity later. It’s really two different
approaches, and it seems valuable to keep the option for indexing each vector
separately. We could expose this flexibility through the aggregation function,
but that might complicate things across all codecs, as they’d need to handle
both aggregate and independent cases efficiently.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

Reply via email to