jimczi commented on PR #12314:
URL: https://github.com/apache/lucene/pull/12314#issuecomment-1563420102

   Thanks for sharing and working on a prototype @alessandrobenedetti !
   
   I have additional questions and comments ;)
   Starting with the devil advocate but I'd like to understand once more what 
is the use case? One possible approach today if the use case is passage 
retrieval is to index one document per passage/sentence. It works fine and you 
have access to the exact context of the match directly. What are the benefits 
of moving all passages to a single document? I am not saying there are none but 
they must be compelling if that requires all these changes.
   
   Regarding the change itself, I wonder if the approach could be more 
explicit. Instead of hiding the new feature I'd argue that we need to follow 
the doc values model where single and multi-valued are separated. In practice 
that would mean adding this to the KnnVectorsReader:
   ```
     public abstract MultiFloatVectorValues getMultiFloatVectorValues(String 
field) throws IOException;
     public abstract MultiByteVectorValues getMultiByteVectorValues(String 
field) throws IOException;
     public abstract TopDocs searchMulti(String field, float[] target, int k, 
Bits acceptDocs, int visitedLimit) throws IOException;
   ```
   The writer side is a bit more tricky but you see what I mean. It's a big 
change that impacts the implementation and expectations of many existing 
functions if the change is directly inserted in the existing knn field. 
   I know that it's easy to detect if a field is single or multi-valued under 
the hood so we could handle this transparently.
   We can do that with the explicit approach too. `searchMulti` can use 
`search` under the hood if we detect that the field contains exactly one vector 
per document. So my point is that we can try to share code as much as possible 
internally but we don't need to hide the difference externally.
   
   Another challenge for this change is the possible overflow in the graph 
representation. If we change the ordinal value to be the offset of the 
multivalued position, the maximum allowed is not per doc anymore but per 
vector. We use int32 in the graph to represents these ordinals, which matches 
with the max doc limit of Lucene. Maybe not a big deal for small scale but 
that's something to keep in mind when implementing things like merge or even 
multi-segments search.
   
   Anyway, quite exciting discussions, Ben and Mayya are reviewing too so just 
adding my 2cents and happy to help if/when we reach a consensus.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to