mikemccand commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-4327205371

   > > I think a user visible/explicit "one field referencing/depending on 
another field" is a sort of anti-pattern in Lucene? Fields are supposed to be 
independent from one another
   > 
   > I am going to challenge this assumption since we introduced index sorting.
   
   Index time sorting is an impactful Lucene feature, I think not used as often 
as it could/should be, but I don't see it as an example of this spooky 
field-to-field coupling in a `Document` while indexing, that we generally try 
to avoid (fields are fully independent).
   
   It's more of a global setting "sort whole index by X".
   
   > In addition, if you want to build your index dependant on another value, 
it will be better that your data is sorted by that value otherwise it becomes 
tricky. I think this idea is very tightly related to index sorting.
   
   I don't really understand the connection between index sorting and 
index-time HNSW filters.  Maybe you are saying, if you have labels for your 
vectors (which might be multi-valued, but let's pretend single-valued) and you 
build dedicated HNSW graph during indexing (this issue), that you should also 
then sort your index by these labels?  I guess for better locality of the 
vectors when using the HNSW graph for that label?  OK that makes sense, it 
should help.  But I think that's just an opto -- if I want to sort the index by 
something else important, that's fine.
   
   > * `Sorted{,Set}DocValues` does feel like the right data structure because 
it handles a lot of the grouping aspects that are needed here.
   
   Hmm what do you mean by grouping aspects?  If we go with `String[] labels` 
sort of explicit user control (which we may not -- it looks like the 
automagically dedup path is promising!  I have yet to catch up...), then 
SDV/SSDV are often good ways to store such things.  But we shouldn't force this 
on the user?  They might not need that forward index ("lookup label for 
docid=X").  They may only need these labels to specifically identify set of 
vectors to search, not as a "return field" or "sort by" etc.
   
   > * The KNN writer can consume DocValues with a bit of plumbing. In the 
ingestion path it forces us to wait until the segment is flushed to construct 
the graph. In the merge path it feels like we're duplicating at least some of 
the work required to merge DocValues, although I guess if we introduced some 
constraints on field ordering in merge we could use a merged view.
   
   Yeah so it's exactly these kinds of spooky "let's make IW's 
indexing/buffering hairier by coupling fields together"-ness that feels wrong 
as we go down the path of "field A depends on field B in Document".  Why not 
pass the optional `String[] labels` to the `*VectorField` only?  It sidesteps 
all these complexities?
   
   > * In addition to a graph per doc value we also need a secondary map that 
either maps docids <-> vector ordinals, or maps global vector ordinals <-> 
local vector ordinals. This is probably true in any solution.
   
   +1
   
   I think all of these ideas come down to adding a deref (int or long ordinal) 
for vectors (which anyways already exists if not every doc has a vector?  and 
maybe if a vector field has multiple vectors (hmm does Lucene have multi-valued 
vector fields yet?  i think not?  but it was discussed?  it would mean/require 
`long` not `int` ordinal?))
   
   > * The search interface would still need to take `Optional<BytesRef>` or 
similar to choose the right graph.
   
   +1
   
   And +1 for `BytesRef` labels not `String` ... what was I thinking (up above)!
   
   > Maybe the right solution is to compose it the other way -- extend the 
interface to accept labels at ingestion and query time and use DocValues 
underneath?
   
   +1 to simplify the API to the user, if it decouples the two 
spooky-action-at-distance linked fields.
   
   But under-the-hood doc values would also introduce the same couplings / 
complexities / constraints in IW's inner (hairy!) classes?  And, it might be 
overkill (we don't need the forward-index that doc values impls work so hard to 
provide)?
   
   [I still have concerns with the whole `BytesRef label` approach ... at query 
time would we accept disjunction `BytesRef[] labels`?  Would we accept 
conjunction?  Feels like we are beginning to re-implement Lucene's whole 
`Query` path ... but we gotta start somewhere, so I think index-time 
`BytesRef[] labels` is a good PnP (progress not perfection) first cut.  And, 
one can do (ish) arbitrary `Query` on top of a labels sort of solution...]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to