mccullocht commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-4328876425
I think one advantage of going the DocValues route is that folks who want to
partition the vector index this way likely already have an inverted index and
possibly DocValues for the information they want to partition on. Not sure if
this is enough of a reason to pursue this path.
> Hmm what do you mean by grouping aspects? If we go with String[] labels
sort of explicit user control (which we may not -- it looks like the
automagically dedup path is promising! I have yet to catch up...), then
SDV/SSDV are often good ways to store such things. But we shouldn't force this
on the user? They might not need that forward index ("lookup label for
docid=X"). They may only need these labels to specifically identify set of
vectors to search, not as a "return field" or "sort by" etc.
We're generating graphs indexes for sets of vectors that match a label --
identifying which graphs we need to build has some overlap with SDV/SSDV IMO,
at least more so than doing this with BDV. Agreed that we don't really need the
forward index so there's opportunity there, I was just thinking about avoiding
more custom data structure serialization.
> I think all of these ideas come down to adding a deref (int or long
ordinal) for vectors (which anyways already exists if not every doc has a
vector? and maybe if a vector field has multiple vectors (hmm does Lucene have
multi-valued vector fields yet? i think not? but it was discussed? it would
mean/require long not int ordinal?))
I don't think there are any *indexed* multi-valued vector fields, although
IIRC you can put multiple vectors together in DocValues and use it as a source
for rescoring. Users can still get this kind of behavior with nesting which
simplifies the implementation a little bit.
> [I still have concerns with the whole BytesRef label approach ... at query
time would we accept disjunction BytesRef[] labels? Would we accept
conjunction? Feels like we are beginning to re-implement Lucene's whole Query
path ... but we gotta start somewhere, so I think index-time BytesRef[] labels
is a good PnP (progress not perfection) first cut. And, one can do (ish)
arbitrary Query on top of a labels sort of solution...]
Maybe I would build the API to allow the possibility of making this a label
query in the future but I think initially I would only support a single label
at search time, for a couple of different reasons:
* I don't think we can take advantage of searching multiple graphs
simultaneously.
* Users can run the query multiple times or impose a filter on top of a
graph that contains the union of the input vectors they want.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]