mikemccand commented on issue #14758: URL: https://github.com/apache/lucene/issues/14758#issuecomment-4290267510
I think a user visible/explicit "one field referencing/depending on another field" is a sort of anti-pattern in Lucene? Fields are supposed to be independent from one another. The order that IW processes field in a doc is undefined, and when a field is flushed IW can free up the buffered RAM regardless of whether other fields are flushed, Codec read/write APIs are geared around one field at a time, etc. If we can help it, I'd rather not intentionally introduce a user visible "field A depend on field B" linkage here... it breaks with Lucene's convention / normal approach to fields, and also makes the implementation more complex. A simple (optional) `String[] labels` on `KnnVectorField` would keep all this filter logic associated with just that one field. Codec is then free to take different approaches, maybe a single graph where the transitions might have labels, or separate graph per label, or `IVFFlat` variant, or ... I can see the appeal of the cross-field approach, though. It's likely those label(s) are already being indexed into a field on the doc (maybe a `category` field for groceries, say). It's also somewhat like a relational DB where a table has N columns that are independent, but you can separately `create index on ...` for multiple columns (this feature would be a DB table index on a vector column and the `category` column for example). It might also generalize beyond a simple string label to any search-time `Filter` -- maybe I want to optimize vector search specifically for the `category = "candy" and price < $5` or so. Not sure what THAT api would look like. Maybe we'd need some fast/limited/approximate "does `Document` match `Query`" or so (oh, that's `MemoryIndex` maybe) -- let's leave that for phase 2 or 7 ;) Anyways, I hope we can somehow start iterating on something here, a simple starting / MVP / dirt path. We're kinda stuck in analysis paralysis since there are so many ways this could go :) And this problem (poor filtered vector recall for restrictive filters) is a crucial limitation of Lucene's KNN vector search today (Lucene's [radius vector query](https://github.com/apache/lucene/blob/34c9495241dca193d24adda98ff12623bd43c2de/lucene/core/src/java/org/apache/lucene/search/FloatVectorSimilarityQuery.java) is immune to this problem (I think?), and also fits more correctly (than KNN) as a Lucene Query, but at possibly high CPU cost...). Dedicated vector search engines like [`turbopuffer` describe how they solve the filtered vector recall problem](https://turbopuffer.com/blog/native-filtering). @kaivalnp original proposal (Lucene/Codec dedups multiple vector fields within one document) is appealing (no user-visible field linking, no user specific labels to specify, just ordinary vector fields), but maybe more complex to implement than explicit user labels or bitsets? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
