Re: [I] Support multiple HNSW graphs backed by the same vectors [lucene]

via GitHub Mon, 27 Apr 2026 09:54:57 -0700


mccullocht commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-4328876425


   I think one advantage of going the DocValues route is that folks who want to 
partition the vector index this way likely already have an inverted index and 
possibly DocValues for the information they want to partition on. Not sure if 
this is enough of a reason to pursue this path.
   
   > Hmm what do you mean by grouping aspects? If we go with String[] labels 
sort of explicit user control (which we may not -- it looks like the 
automagically dedup path is promising! I have yet to catch up...), then 
SDV/SSDV are often good ways to store such things. But we shouldn't force this 
on the user? They might not need that forward index ("lookup label for 
docid=X"). They may only need these labels to specifically identify set of 
vectors to search, not as a "return field" or "sort by" etc.
   
   We're generating graphs indexes for sets of vectors that match a label -- 
identifying which graphs we need to build has some overlap with SDV/SSDV IMO, 
at least more so than doing this with BDV. Agreed that we don't really need the 
forward index so there's opportunity there, I was just thinking about avoiding 
more custom data structure serialization.
   
   > I think all of these ideas come down to adding a deref (int or long 
ordinal) for vectors (which anyways already exists if not every doc has a 
vector? and maybe if a vector field has multiple vectors (hmm does Lucene have 
multi-valued vector fields yet? i think not? but it was discussed? it would 
mean/require long not int ordinal?))
   
   I don't think there are any *indexed* multi-valued vector fields, although 
IIRC you can put multiple vectors together in DocValues and use it as a source 
for rescoring. Users can still get this kind of behavior with nesting which 
simplifies the implementation a little bit.
   
   > [I still have concerns with the whole BytesRef label approach ... at query 
time would we accept disjunction BytesRef[] labels? Would we accept 
conjunction? Feels like we are beginning to re-implement Lucene's whole Query 
path ... but we gotta start somewhere, so I think index-time BytesRef[] labels 
is a good PnP (progress not perfection) first cut. And, one can do (ish) 
arbitrary Query on top of a labels sort of solution...]
   
   Maybe I would build the API to allow the possibility of making this a label 
query in the future but I think initially I would only support a single label 
at search time, for a couple of different reasons:
   * I don't think we can take advantage of searching multiple graphs 
simultaneously.
   * Users can run the query multiple times or impose a filter on top of a 
graph that contains the union of the input vectors they want.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Support multiple HNSW graphs backed by the same vectors [lucene]

Reply via email to