benwtrent commented on issue #12313: URL: https://github.com/apache/lucene/issues/12313#issuecomment-2045604262
I do think things like `ColBERT` would benefit from having multiple vectors for a single document field. One crazy idea I had (others have probably already thought of this, and found it wanting...) is since HNSW supports non-euclidean space already, what if HNSW graph nodes simply represented more than one vector? Then the flat storage system and underlying scorer could handle the distance computations and HNSW itself doesn't actually have to change. I could see this maybe hurting recall, but I wonder in practice how bad it would actually hurt things. The idea would be: - A new FlatVectorFormat type that allows more than one vector (or possibly extending the existing ones) - That type would provide a scorer to HNSW that resolves the multi-vector scores by providing a particular aggregation of the scores of the vectors. This could be "max", "min", "avg", "sum" or something. - Then we need to test how recall is for the graph for individual vectors as a query could be one vector (regular passage search) or multiple (ColBERT). HNSW doesn't actually look at the vectors at all, it simply provides an ordinal and requests a score, so the change in regards to code wouldn't be too bad I think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org