alessandrobenedetti opened a new pull request, #12314: URL: https://github.com/apache/lucene/pull/12314
### Description This pull request aims to introduce support for multiple values in a single Knn vector field. The adopted solution relies on: **Index time** Sparse vector values approach where an Ordinal(vectorId) to DocId map is used to keep the relation between a DocId and all its vectors. In the current sparse vector approach, we have just one vectorId per docID In this proposed contribution, multiple vectorIds are mapped to the same docID **Query time** A multi-valued strategy choice is offered to the user: MAX/SUM In exact nearest neighbor, for each document accepted by the query/filter : MAX = the similarity score between the query and each vector is computed, the max score is chosen for the search result SUM = the similarity score between the query and each vector is computed, all scores are summed to get the final score In aproximate nearest neighbor, for each document accepted by the query/filter : MAX = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated keeping the maximum between what it was there and the new score SUM = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated summing the old and new score N.B. This Pull Request is not meant to be ready to be merged at this stage. I can identify at least this set of activities before this draft can move to a 'production ready' version: 1) validate the overall idea and approach 2) validate index time usage of sparse vector values for the multi-valued use case 3) validate merge policy for the multi-valued use case 4) validate query time MAX/SUM approach 5) validate query time modifiable heap and neighborQueue usage 6) validate regressions 7) introduce more tests It's a big contribution and It will take time and effort to be completed. Any help is welcome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org