[GitHub] [lucene] alessandrobenedetti opened a new pull request, #12314: Multi-value support for KnnVectorField

via GitHub Fri, 19 May 2023 08:47:49 -0700


alessandrobenedetti opened a new pull request, #12314:
URL: https://github.com/apache/lucene/pull/12314


   ### Description
   This pull request aims to introduce support for multiple values in a single 
Knn vector field.
   The adopted solution relies on:
   **Index time**
   Sparse vector values approach where an Ordinal(vectorId) to DocId map is 
used to keep the relation between a DocId and all its vectors.
   In the current sparse vector approach, we have just one vectorId per docID
   In this proposed contribution, multiple vectorIds are mapped to the same 
docID
   **Query time**
   A multi-valued strategy choice is offered to the user:
   MAX/SUM
   In exact nearest neighbor, for each document accepted by the query/filter :
   MAX = the similarity score between the query and each vector is computed, 
the max score is chosen for the search result
   SUM = the similarity score between the query and each vector is computed, 
all scores are summed to get the final score 
   
   In aproximate nearest neighbor, for each document accepted by the 
query/filter :
   MAX = every time we find a nearest neighbor vector to be added to the topK, 
if the document is already there, its score is updated keeping the maximum 
between what it was there and the new score
   SUM = every time we find a nearest neighbor vector to be added to the topK, 
if the document is already there, its score is updated summing the old and new 
score
   
   N.B. This Pull Request is not meant to be ready to be merged at this stage.
   I can identify at least this set of activities before this draft can move to 
a 'production ready' version:
   
   1) validate the overall idea and approach
   2) validate index time usage of sparse vector values for the multi-valued 
use case
   3) validate merge policy for the multi-valued use case
   4) validate query time MAX/SUM approach
   5) validate query time modifiable heap and neighborQueue usage
   6) validate regressions
   7) introduce more tests
   
   It's a big contribution and It will take time and effort to be completed.
   Any help is welcome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] alessandrobenedetti opened a new pull request, #12314: Multi-value support for KnnVectorField

Reply via email to