alessandrobenedetti commented on PR #12314: URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564165188
> Thanks for sharing and working on a prototype @alessandrobenedetti ! > > I have additional questions and comments ;) Starting with the devil advocate but I'd like to understand once more what is the use case? One possible approach today if the use case is passage retrieval is to index one document per passage/sentence. It works fine and you have access to the exact context of the match directly. What are the benefits of moving all passages to a single document? I am not saying there are none but they must be compelling if that requires all these changes. Ale: I guess the use case for multi valued vector-fields is not much different from any other multi valued fields: you may want to avoid normalisation and the necessity to build additional complexity with multiple duplicated documents that only have the vector field as a difference. If you have simple documents with just the vector field, splitting them in passages and then aggregating them in the end of your search pipeline is not going to be that annoying, but imaging more complex situations where you have aggregations already, nested documents or just documents with many more metadata along. On top of that we got curiosity from some customers about the functionality, for this reason I felt it was a nice addition. > > Regarding the change itself, I wonder if the approach could be more explicit. Instead of hiding the new feature I'd argue that we need to follow the doc values model where single and multi-valued are separated. In practice that would mean adding this to the KnnVectorsReader: > > ``` > public abstract MultiFloatVectorValues getMultiFloatVectorValues(String field) throws IOException; > public abstract MultiByteVectorValues getMultiByteVectorValues(String field) throws IOException; > public abstract TopDocs searchMulti(String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException; > ``` > > The writer side is a bit more tricky but you see what I mean. It's a big change that impacts the implementation and expectations of many existing functions if the change is directly inserted in the existing knn field. I know that it's easy to detect if a field is single or multi-valued under the hood so we could handle this transparently. We can do that with the explicit approach too. `searchMulti` can use `search` under the hood if we detect that the field contains exactly one vector per document. So my point is that we can try to share code as much as possible internally but we don't need to hide the difference externally. That was the initial approach, it was explicit at index and query time, and they were separate code paths from the single valued use case. So it was not affecting the single valued scenario at all. Following some of the feedback here I moved to transparent approach, that prooved extremely beneficial in reducing the complexity of the pull request (at the cost of impacting potentially some single valued case scenario). I iterated a bit, so I should have reduced the impact already on the single valued case, but not to 100% I guess. > > Another challenge for this change is the possible overflow in the graph representation. If we change the ordinal value to be the offset of the multivalued position, the maximum allowed is not per doc anymore but per vector. We use int32 in the graph to represents these ordinals, which matches with the max doc limit of Lucene. Maybe not a big deal for small scale but that's something to keep in mind when implementing things like merge or even multi-segments search. Yes, with one vector per document, the maximum amount of supported vectors was in line with the docs, this is still the case but I agree that right now we potentially support less docs, happy to change that. > > Anyway, quite exciting discussions, Ben and Mayya are reviewing too so just adding my 2cents and happy to help if/when we reach a consensus. Thanks for the feedback! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org