benwtrent commented on issue #12313:
URL: https://github.com/apache/lucene/issues/12313#issuecomment-2045604262

   I do think things like `ColBERT` would benefit from having multiple vectors 
for a single document field.
   
   One crazy idea I had (others have probably already thought of this, and 
found it wanting...) is since HNSW supports non-euclidean space already, what 
if HNSW graph nodes simply represented more than one vector?
   
   Then the flat storage system and underlying scorer could handle the distance 
computations and HNSW itself doesn't actually have to change. 
   
   I could see this maybe hurting recall, but I wonder in practice how bad it 
would actually hurt things.
   
   The idea would be:
   
    - A new FlatVectorFormat type that allows more than one vector (or possibly 
extending the existing ones)
    - That type would provide a scorer to HNSW that resolves the multi-vector 
scores by providing a particular aggregation of the scores of the vectors. This 
could be "max", "min", "avg", "sum" or something.
    - Then we need to test how recall is for the graph for individual vectors 
as a query could be one vector (regular passage search) or multiple (ColBERT).
   
   HNSW doesn't actually look at the vectors at all, it simply provides an 
ordinal and requests a score, so the change in regards to code wouldn't be too 
bad I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to