benwtrent commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1651725953
@msokolov && @alessandrobenedetti pinging y'all as you will probably be the most interested in this change. @alessandrobenedetti the original design did take some inspiration from your multi-value vector work. However, benchmarking & testing required significant changes. For deduplicating parent docIds during search, the hashMap is now part of the queue instead of iterating a cache outside the heap. This improved performance significantly. I would say this is how folks should represent multi-valued vectors when they require access to the matching passage or additional metadata. Otherwise, deep changes are required in the codec to attach arbitrary metadata to the vectors themselves, which seems like overkill to me when we already have `join`. This does not obviate the need for "true" multi-value vector support (e.g. for late-interaction models, or multi-value vectors that don't require metadata). This does lay some nice groundwork that can improve that implementation (a custom collector that can deduplicate vectors to a docId while searching). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org