Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

via GitHub Wed, 20 Nov 2024 04:05:49 -0800


krickert commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2488410934


   > And we can use getAllVectorValues() for scoring with max or avg of all 
vectors in the doc at query time.
   
   Your proposal to implement `getAllVectorValues()` for scoring documents by 
aggregating their vectors (using methods like max or average) at query time has 
a lot of use cases and think it's a great idea. However, in my domain-specific 
data, this approach hasn't enhanced search results.  However, providing a 
default implementation, as you suggested, with the option for customization, 
could be beneficial.
   
   (sidenote: if you are doing max/average, you can do that during index time 
though, right?)
   
   I'm currently conducting A/B tests on three methods to retrieve and rank 
documents with multiple vectors:
   
   1. **Aggregate Scoring:** Computing a single relevance score per document by 
aggregating all its vectors. Flexibility in the aggregation method would help 
me a lot.
   2. **Chunk-Based Highlighting:** Treating each vector as a distinct document 
chunk to facilitate highlighting. This involves returning the top N documents 
so K would be more dynamic based on aggregate scores, with each document 
potentially containing multiple relevant sections - which would make K a bit 
more dynamic because we want the top N documents and K would represent the 
chunks that represent those documents. Implementing thresholds per-doc can help 
manage performance.
   3. **Custom Aggregation with Embedding Tags:** Associating vectors with 
specific tags, such as user access levels or n-gram embeddings, to enable 
dynamic aggregation strategies. This allows for personalized and 
context-sensitive relevance scoring and would require the ability to 
override/customize.
   
   The third approach is particularly promising for domain-specific 
applications, where standard aggregation methods may not suffice. For instance, 
embedding tags could be linked to user access controls, unlocking certain 
vectors at query time, or to specific n-grams, activating them based on query 
content.
   
   Incorporating a mechanism to override the default aggregation method would 
facilitate experimentation with these strategies.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

Reply via email to