Re: [I] Support multiple HNSW graphs backed by the same vectors [lucene]

via GitHub Tue, 21 Apr 2026 09:43:39 -0700


mikemccand commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-4290267510

I think a user visible/explicit "one field referencing/depending on another
field" is a sort of anti-pattern in Lucene? Fields are supposed to be
independent from one another. The order that IW processes field in a doc is
undefined, and when a field is flushed IW can free up the buffered RAM
regardless of whether other fields are flushed, Codec read/write APIs are
geared around one field at a time, etc.

If we can help it, I'd rather not intentionally introduce a user visible
"field A depend on field B" linkage here... it breaks with Lucene's convention
/ normal approach to fields, and also makes the implementation more complex.

A simple (optional) `String[] labels` on `KnnVectorField` would keep all
this filter logic associated with just that one field. Codec is then free to
take different approaches, maybe a single graph where the transitions might
have labels, or separate graph per label, or `IVFFlat` variant, or ...

I can see the appeal of the cross-field approach, though. It's likely those
label(s) are already being indexed into a field on the doc (maybe a `category`
field for groceries, say). It's also somewhat like a relational DB where a
table has N columns that are independent, but you can separately `create index
on ...` for multiple columns (this feature would be a DB table index on a
vector column and the `category` column for example).

It might also generalize beyond a simple string label to any search-time
`Filter` -- maybe I want to optimize vector search specifically for the
`category = "candy" and price < $5` or so. Not sure what THAT api would look
like. Maybe we'd need some fast/limited/approximate "does `Document` match
`Query`" or so (oh, that's `MemoryIndex` maybe) -- let's leave that for phase 2
or 7 ;)

Anyways, I hope we can somehow start iterating on something here, a simple
starting / MVP / dirt path. We're kinda stuck in analysis paralysis since
there are so many ways this could go :) And this problem (poor filtered vector
recall for restrictive filters) is a crucial limitation of Lucene's KNN vector
search today (Lucene's [radius vector
query](https://github.com/apache/lucene/blob/34c9495241dca193d24adda98ff12623bd43c2de/lucene/core/src/java/org/apache/lucene/search/FloatVectorSimilarityQuery.java)
is immune to this problem (I think?), and also fits more correctly (than KNN)
as a Lucene Query, but at possibly high CPU cost...). Dedicated vector search
engines like [`turbopuffer` describe how they solve the filtered vector recall
problem](https://turbopuffer.com/blog/native-filtering).

@kaivalnp original proposal (Lucene/Codec dedups multiple vector fields
within one document) is appealing (no user-visible field linking, no user
specific labels to specify, just ordinary vector fields), but maybe more
complex to implement than explicit user labels or bitsets?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Support multiple HNSW graphs backed by the same vectors [lucene]

Reply via email to