krickert commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2488410934
> And we can use getAllVectorValues() for scoring with max or avg of all vectors in the doc at query time. Your proposal to implement `getAllVectorValues()` for scoring documents by aggregating their vectors (using methods like max or average) at query time has a lot of use cases and think it's a great idea. However, in my domain-specific data, this approach hasn't enhanced search results. However, providing a default implementation, as you suggested, with the option for customization, could be beneficial. (sidenote: if you are doing max/average, you can do that during index time though, right?) I'm currently conducting A/B tests on three methods to retrieve and rank documents with multiple vectors: 1. **Aggregate Scoring:** Computing a single relevance score per document by aggregating all its vectors. Flexibility in the aggregation method would help me a lot. 2. **Chunk-Based Highlighting:** Treating each vector as a distinct document chunk to facilitate highlighting. This involves returning the top N documents so K would be more dynamic based on aggregate scores, with each document potentially containing multiple relevant sections - which would make K a bit more dynamic because we want the top N documents and K would represent the chunks that represent those documents. Implementing thresholds per-doc can help manage performance. 3. **Custom Aggregation with Embedding Tags:** Associating vectors with specific tags, such as user access levels or n-gram embeddings, to enable dynamic aggregation strategies. This allows for personalized and context-sensitive relevance scoring and would require the ability to override/customize. The third approach is particularly promising for domain-specific applications, where standard aggregation methods may not suffice. For instance, embedding tags could be linked to user access controls, unlocking certain vectors at query time, or to specific n-grams, activating them based on query content. Incorporating a mechanism to override the default aggregation method would facilitate experimentation with these strategies. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org