benwtrent commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1656874923
Thanks for digging in @msokolov! > I'd like to have a clearer sense of the problem you're solving. This PR solves a similar, but different problem to: https://github.com/apache/lucene/issues/12313 Text embedding models have a token limit, so when processing larger text inputs, they need to be divided into passages. These passages share a common parent document. When users search for the "top-k" documents, they expect the initial parent document as the result, not just individual passages. A "multi-value" vector only partially solves this. The user will need to know the nearest passages across those documents to use in retrieval augmented generation. Multi-value fields cannot solve this as metadata needs to be associated with each vector to tie them to an originating passage, or better yet, some field containing the passage text itself. `join` seemed like a natural place to tackle this. - We get the top-k parent documents (e.g. the users larger chunk of text) - And can still get the nearest passages from that deduplicated set of parent documents Not to mention the nice flexibility we get (filtering on passage metadata, filtering on parent documents, hybrid scoring on the parent or child level, etc.) > What confuses me is, I would have expected something like `ToParentBlockJoinQuery(parentBitSet, KnnVectorQuery())` to more or less work already. The main issue is that it won't return the correct number of parent documents when the user requests the top-k parents based on their children vectors. If there are multiple children per parent, this approach may return fewer than k parent documents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org