benwtrent commented on PR #12434:
URL: https://github.com/apache/lucene/pull/12434#issuecomment-1656874923

   Thanks for digging in @msokolov!
   
   > I'd like to have a clearer sense of the problem you're solving.
   
   This PR solves a similar, but different problem to: 
https://github.com/apache/lucene/issues/12313 
   
   Text embedding models have a token limit, so when processing larger text 
inputs, they need to be divided into passages. These passages share a common 
parent document. When users search for the "top-k" documents, they expect the 
initial parent document as the result, not just individual passages.
   
   A "multi-value" vector only partially solves this. The user will need to 
know the nearest passages across those documents to use in retrieval augmented 
generation. Multi-value fields cannot solve this as metadata needs to be 
associated with each vector to tie them to an originating passage, or better 
yet, some field containing the passage text itself. `join` seemed like a 
natural place to tackle this.
   
    - We get the top-k parent documents (e.g. the users larger chunk of text)
    - And can still get the nearest passages from that deduplicated set of 
parent documents
   
   Not to mention the nice flexibility we get (filtering on passage metadata, 
filtering on parent documents, hybrid scoring on the parent or child level, 
etc.)
   
   > What confuses me is, I would have expected something like 
`ToParentBlockJoinQuery(parentBitSet, KnnVectorQuery())` to more or less work 
already.
   
   The main issue is that it won't return the correct number of parent 
documents when the user requests the top-k parents based on their children 
vectors. If there are multiple children per parent, this approach may return 
fewer than k parent documents.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to