mikemccand commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-4290267510

   I think a user visible/explicit "one field referencing/depending on another 
field" is a sort of anti-pattern in Lucene?   Fields are supposed to be 
independent from one another.  The order that IW processes field in a doc is 
undefined, and when a field is flushed IW can free up the buffered RAM 
regardless of whether other fields are flushed, Codec read/write APIs are 
geared around one field at a time, etc.
   
   If we can help it, I'd rather not intentionally introduce a user visible 
"field A depend on field B" linkage here... it breaks with Lucene's convention 
/ normal approach to fields, and also makes the implementation more complex.
   
   A simple (optional) `String[] labels` on `KnnVectorField` would keep all 
this filter logic associated with just that one field.  Codec is then free to 
take different approaches, maybe a single graph where the transitions might 
have labels, or separate graph per label, or `IVFFlat` variant, or ...
   
   I can see the appeal of the cross-field approach, though.  It's likely those 
label(s) are already being indexed into a field on the doc (maybe a `category` 
field for groceries, say).  It's also somewhat like a relational DB where a 
table has N columns that are independent, but you can separately `create index 
on ...` for multiple columns (this feature would be a DB table index on a 
vector column and the `category` column for example).
   
   It might also generalize beyond a simple string label to any search-time 
`Filter` -- maybe I want to optimize vector search specifically for the 
`category = "candy" and price < $5` or so.  Not sure what THAT api would look 
like.  Maybe we'd need some fast/limited/approximate "does `Document` match 
`Query`" or so (oh, that's `MemoryIndex` maybe) -- let's leave that for phase 2 
or 7 ;)
   
   Anyways, I hope we can somehow start iterating on something here, a simple 
starting / MVP / dirt path.  We're kinda stuck in analysis paralysis since 
there are so many ways this could go :)  And this problem (poor filtered vector 
recall for restrictive filters) is a crucial limitation of Lucene's KNN vector 
search today (Lucene's [radius vector 
query](https://github.com/apache/lucene/blob/34c9495241dca193d24adda98ff12623bd43c2de/lucene/core/src/java/org/apache/lucene/search/FloatVectorSimilarityQuery.java)
 is immune to this problem (I think?), and also fits more correctly (than KNN) 
as a Lucene Query, but at possibly high CPU cost...).  Dedicated vector search 
engines like [`turbopuffer` describe how they solve the filtered vector recall 
problem](https://turbopuffer.com/blog/native-filtering).
   
   @kaivalnp original proposal (Lucene/Codec dedups multiple vector fields 
within one document) is appealing (no user-visible field linking, no user 
specific labels to specify, just ordinary vector fields), but maybe more 
complex to implement than explicit user labels or bitsets?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to