Re: [I] Support multiple HNSW graphs backed by the same vectors [lucene]

via GitHub Thu, 23 Apr 2026 23:13:18 -0700


kaivalnp commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-4311081631


   Thank you everyone for the suggestions here, very interesting!
   
   Having Lucene use explicit information about the graphs it has to create 
(either via `String[]` labels or referencing a `label` field) indeed has a lot 
of benefit -- Lucene does not need to "infer" that information on its own.
   
   I guess the main challenge is API-wise, about how to pass that information 
during indexing and search, without changing a lot of functions?
   
   The explicit `String[]` labels approach is what I had in mind 
[initially](https://github.com/apache/lucene/issues/14758#issue-3118912411), 
but as @jpountz pointed out 
[here](https://github.com/apache/lucene/issues/14758#issuecomment-2944618497), 
we'll need parallel APIs for most vector-related functions that also take label 
info, like:
   1. [`Knn*VectorField` 
constructor](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java#L90-L103)
 (add set of labels)
   2. 
[`KnnFieldVectorsWriter#addValue`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnFieldVectorsWriter.java#L33-L37)
 (pass set of labels)
   3. 
[`KnnVectorsReader#get*VectorValues`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java#L60-L65)
 (optional label to read)
   4. 
[`KnnVectorsReader#search`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java#L74-L101)
 (optional label to search)
   5. 
[`CodecReader#searchNearestVectors`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/index/CodecReader.java#L261-L274)
 (optional label to search)
   6. 
[`Knn*VectorQuery`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L81-L98)
 (optional label to search)
   
   ..among others, which sounds invasive to me :(
   
   The inter-field dependency suggestion relaxes some of these:
   1. For (1), the `label` field can be specified in the format itself, like 
@benwtrent suggested 
[here](https://github.com/apache/lucene/issues/14758#issuecomment-4253615422).
   2. For (2) the writer can read the `label` field contents during indexing 
and merge:
        1. During indexing, we still need to wait until flush to create the 
smaller HNSW graphs (because this is where the `label` field is guaranteed to 
be complete). We also need some way of getting and making a reader for the 
`label` field available in 
[`KnnVectorsWriter#flush`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L52-L53),
 but this sounds doable from what @iverase suggested 
[here](https://github.com/apache/lucene/issues/14758#issuecomment-4259019384).
        2. Accessing label information during merge also sounds trivial, by 
getting a view over the field across all segments in `mergeState`, which is 
[already 
available](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L85-L90).
   
   However, I don't see how we'll get around (3) (4) (5). Additionally, (6) is 
the tricky part:
   1. The `Knn*VectorQuery` [accepts an arbitrary `Query 
filter`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L77),
 which will need to be mapped to an appropriate label.
   2. While this could _potentially_ be as simple as mapping a `TermQuery` 
(having the same `label` field in the key) to the appropriate graph, the 
`label` field itself is format-specific, and unknown to the query?
   3. i.e. how would the query know whether to collect matching docs into a 
`BitSet` (for when a smaller graph is not available), OR pass the `TermQuery` 
filter as a "hint" to the format to choose a smaller HNSW graph internally? 
(again, this "hint" needs to be added to the `AcceptDocs` abstraction 
[here](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java#L100)).
   
   I still see some challenges here, which is why I'd like to throw in the 
approach of Lucene de-duplicating vectors for the user back into the mix :)
   
   This approach shifts responsibility of resolving the correct field + label 
information to the user. For example, if the above approaches add a `vector` 
field + labels `[1, 2, 3]` for a document -- the user now needs to add three 
top-level vector fields, (say) `vector_1`, `vector_2`, `vector_3` with the same 
vector, and Lucene will store a single copy of the vector on disk. At search 
time, the user also needs to know the correct field to search, based on filter 
information.
   
   Some other advantages of this approach I can think of:
   1. De-duplication is agnostic to labels, and one field does not have to be a 
pure subset of another to share backing vectors (e.g. if many, but not all 
vectors are shared).
   2. De-duplication is index-level, and vectors across documents and different 
fields can be shared too (e.g. if the vector was generated on a subset of 
fields of the document, like say `title`, then a single vector can be shared 
across all documents with the same `title`).
   
   Here, IMO the only real issue is the additional work at indexing to "infer" 
duplicates? (that is known to the user, and should ideally be used by Lucene, 
but is not easily doable due to challenges listed above)
   
   Using an AI, I was able to whip up an implementation for a de-duplicating 
vector format (see #15979) that needs minimal API changes (drop-in replacement 
for 
[`Lucene99FlatVectorsFormat`](https://github.com/apache/lucene/blob/db3222d4915a1e42071ea2b301637ac9691e947c/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsFormat.java#L74)).
   
   As @mikecan mentioned 
[here](https://github.com/apache/lucene/issues/14758#issuecomment-4290267510), 
maybe this is a start to having multiple HNSW graphs backed by the same raw 
vectors, albeit in an expensive way for indexing, until we converge on how to 
do it safely / cleanly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Support multiple HNSW graphs backed by the same vectors [lucene]

Reply via email to