benwtrent commented on issue #14758: URL: https://github.com/apache/lucene/issues/14758#issuecomment-3599161149
I was thinking about this off/on over the past couple of weeks. The de-duping and multiple graphs is so complex :(. Not that this is good enough reason for dismissing the change. The is primarily filtered vector search. Usually around multi-tenant scenarios, but could also be time ranges, etc. Could we do something structurally to HNSW when the documents are sorted? I would suspect that in typical multi-tenant, and highly filtered scenarios, the index is sorted (thus making ALL lexical searching faster). If the documents are sorted, this is usually a signal that whatever the sorting criteria, the user cares about documents being near each other. And will search respecting the sorting criteria. We could force HNSW to add "sorted index connections". Broadening the random jumps in the graph to enforce some locality connectivity threshold. Meaning, a given node must be connected X nodes that are within a range of Y given their document IDs. I realize this is a "hack" to the HNSW graph, but the key issue to me is not that we are applying filtered search, but that it requires many jumps to get to nodes that are near enough to the query. If we could add "short cuts" within the graph, it might be enough and it would be much more flexible than requiring a tenant focused "green/blue/red" type of fields. P.S...I think QDrant does something like this, but I think they have the luxury of knowing the metadata fields and their values directly when building the graph. P.P.S..., currently, when we start making many jumps without scoring, aka ACORN, we spend silly amounts of CPU time just decoding the stupid graph. GroupVarInt helps some, but daggum, its silly that we just lose that much performance with the administrivia of the graph. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
