benwtrent commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-3599161149

   I was thinking about this off/on over the past couple of weeks. 
   
   The de-duping and multiple graphs is so complex :(. Not that this is good 
enough reason for dismissing the change.
   
   
   The is primarily filtered vector search. Usually around multi-tenant 
scenarios, but could also be time ranges, etc.
   
   Could we do something structurally to HNSW when the documents are sorted? I 
would suspect that in typical multi-tenant, and highly filtered scenarios, the 
index is sorted (thus making ALL lexical searching faster).
   
   If the documents are sorted, this is usually a signal that whatever the 
sorting criteria, the user cares about documents being near each other. And 
will search respecting the sorting criteria.
   
   We could force HNSW to add "sorted index connections". Broadening the random 
jumps in the graph to enforce some locality connectivity threshold. Meaning, a 
given node must be connected X nodes that are within a range of Y given their 
document IDs.
   
   I realize this is a "hack" to the HNSW graph, but the key issue to me is not 
that we are applying filtered search, but that it requires many jumps to get to 
nodes that are near enough to the query. If we could add "short cuts" within 
the graph, it might be enough and it would be much more flexible than requiring 
a tenant focused "green/blue/red" type of fields.
   
   P.S...I think QDrant does something like this, but I think they have the 
luxury of knowing the metadata fields and their values directly when building 
the graph.
   
   
   P.P.S..., currently, when we start making many jumps without scoring, aka 
ACORN, we spend silly amounts of CPU time just decoding the stupid graph. 
GroupVarInt helps some, but daggum, its silly that we just lose that much 
performance with the administrivia of the graph.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to