mikemccand commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-4329105604

   Thank you for the "you are here" summary comment above @kaivalnp!  It helps 
me set all these proposals in a broader context.
   
   > I still see some challenges here, which is why I'd like to throw in the 
approach of Lucene de-duplicating vectors for the user back into the mix :)
   
   +1, if this path can work out (thank you genai!  amazing how suddenly we see 
whole compelling codec formats appearing from our genai... we humans (well this 
human) cannot keep up) I think this is quite clean.  The user understands 
fields, can index vectors (dups or not) into separate fields.  We are not 
spooky-coupling multiple fields but rather one `KnnVectorsFormat` is seeing N 
fields and M documents and dedup'ing across.
   
   And search-time by field is already a well supported / understood path.  I 
guess it's the question of how much added indexing cost (merging is quite 
slower right now in https://github.com/apache/lucene/pull/15979?) is for this 
"magic" deduping.  It's sort of like the magical deduping ZFS does -- whole 
filesystem blocks are dedup'd!
   
   Multiple labels for one vector would "just work" (multiple fields hold the 
same vector).  User could still index an additional doc values field if they 
also want to sort index by that label, or sometimes filter by that label for 
non-vector search...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to