alessandrobenedetti commented on PR #12314:
URL: https://github.com/apache/lucene/pull/12314#issuecomment-1564165188

   > Thanks for sharing and working on a prototype @alessandrobenedetti !
   > 
   > I have additional questions and comments ;) Starting with the devil 
advocate but I'd like to understand once more what is the use case? One 
possible approach today if the use case is passage retrieval is to index one 
document per passage/sentence. It works fine and you have access to the exact 
context of the match directly. What are the benefits of moving all passages to 
a single document? I am not saying there are none but they must be compelling 
if that requires all these changes.
   
   Ale: I guess the use case for multi valued vector-fields is not much 
different from any other multi valued fields:
   you may want to avoid normalisation and the necessity to build additional 
complexity with multiple duplicated documents that only have the vector field 
as a difference.
   If you have simple documents with just the vector field, splitting them in 
passages and then aggregating them in the end of your search pipeline is not 
going to be that annoying, but imaging more complex situations where you have 
aggregations already, nested documents or just documents with many more 
metadata along.
   On top of that we got curiosity from some customers about the functionality, 
for this reason I felt it was a nice addition.
   
   > 
   > Regarding the change itself, I wonder if the approach could be more 
explicit. Instead of hiding the new feature I'd argue that we need to follow 
the doc values model where single and multi-valued are separated. In practice 
that would mean adding this to the KnnVectorsReader:
   > 
   > ```
   >   public abstract MultiFloatVectorValues getMultiFloatVectorValues(String 
field) throws IOException;
   >   public abstract MultiByteVectorValues getMultiByteVectorValues(String 
field) throws IOException;
   >   public abstract TopDocs searchMulti(String field, float[] target, int k, 
Bits acceptDocs, int visitedLimit) throws IOException;
   > ```
   > 
   > The writer side is a bit more tricky but you see what I mean. It's a big 
change that impacts the implementation and expectations of many existing 
functions if the change is directly inserted in the existing knn field. I know 
that it's easy to detect if a field is single or multi-valued under the hood so 
we could handle this transparently. We can do that with the explicit approach 
too. `searchMulti` can use `search` under the hood if we detect that the field 
contains exactly one vector per document. So my point is that we can try to 
share code as much as possible internally but we don't need to hide the 
difference externally.
   
   That was the initial approach, it was explicit at index and query time, and 
they were separate code paths from the single valued use case. So it was not 
affecting the single valued scenario at all.
   Following some of the feedback here I moved to transparent approach, that 
prooved extremely beneficial in reducing the complexity of the pull request (at 
the cost of impacting potentially some single valued case scenario).
   I iterated a bit, so I should have reduced the impact already on the single 
valued case, but not to 100% I guess.
   
   > 
   > Another challenge for this change is the possible overflow in the graph 
representation. If we change the ordinal value to be the offset of the 
multivalued position, the maximum allowed is not per doc anymore but per 
vector. We use int32 in the graph to represents these ordinals, which matches 
with the max doc limit of Lucene. Maybe not a big deal for small scale but 
that's something to keep in mind when implementing things like merge or even 
multi-segments search.
   
   Yes, with one vector per document, the maximum amount of supported vectors 
was in line with the docs, this is still the case but I agree that right now we 
potentially support less docs, happy to change that.
   > 
   > Anyway, quite exciting discussions, Ben and Mayya are reviewing too so 
just adding my 2cents and happy to help if/when we reach a consensus.
   
   Thanks for the feedback!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to