jimczi commented on PR #12314: URL: https://github.com/apache/lucene/pull/12314#issuecomment-1563420102
Thanks for sharing and working on a prototype @alessandrobenedetti ! I have additional questions and comments ;) Starting with the devil advocate but I'd like to understand once more what is the use case? One possible approach today if the use case is passage retrieval is to index one document per passage/sentence. It works fine and you have access to the exact context of the match directly. What are the benefits of moving all passages to a single document? I am not saying there are none but they must be compelling if that requires all these changes. Regarding the change itself, I wonder if the approach could be more explicit. Instead of hiding the new feature I'd argue that we need to follow the doc values model where single and multi-valued are separated. In practice that would mean adding this to the KnnVectorsReader: ``` public abstract MultiFloatVectorValues getMultiFloatVectorValues(String field) throws IOException; public abstract MultiByteVectorValues getMultiByteVectorValues(String field) throws IOException; public abstract TopDocs searchMulti(String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException; ``` The writer side is a bit more tricky but you see what I mean. It's a big change that impacts the implementation and expectations of many existing functions if the change is directly inserted in the existing knn field. I know that it's easy to detect if a field is single or multi-valued under the hood so we could handle this transparently. We can do that with the explicit approach too. `searchMulti` can use `search` under the hood if we detect that the field contains exactly one vector per document. So my point is that we can try to share code as much as possible internally but we don't need to hide the difference externally. Another challenge for this change is the possible overflow in the graph representation. If we change the ordinal value to be the offset of the multivalued position, the maximum allowed is not per doc anymore but per vector. We use int32 in the graph to represents these ordinals, which matches with the max doc limit of Lucene. Maybe not a big deal for small scale but that's something to keep in mind when implementing things like merge or even multi-segments search. Anyway, quite exciting discussions, Ben and Mayya are reviewing too so just adding my 2cents and happy to help if/when we reach a consensus. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org