benchaplin commented on issue #13940: URL: https://github.com/apache/lucene/issues/13940#issuecomment-2557564928
Hey @benwtrent - I played around with implementing this. I tried a bunch of things, benchmarking as I went. Here's what I found: - The performance improvements seem to come from the "predicate subgraph traversal" (aka only score/consider candidates that pass the filter). This saves time because it doesn't score nodes that don't pass the filter, and thus won't be collected as results. However, this approach risks missing filter-passing results 2+ hops away. - Two-hop neighbor expansion mitigates the above risk. I was getting decent results doing predicate subgraph traversal + two-hop expansion. - [Weaviate](https://weaviate.io/blog/speed-up-filtered-vector-search) "conditionally evaluates whether to use the two-hop expansion" - I tried this and saw lower latencies for inclusive filters, so I added it as well. Here's a draft PR with the best benchmarking code I found: https://github.com/apache/lucene/pull/14085. I benchmarked using Cohere embeddings with params: - nDoc = 200000 - topK = 100 - fanout = 50 - maxConn = 32 - beamWidth = 100 - filterSelectivity = [0.05, 0.25, 0.5, 0.75, 0.95] Here are some results: **Baseline:** | filterSelectivity | recall | latency (ms) | | --------------- | ------ | ------------ | | 0.05 | 0.037 | 17.182 | | 0.25 | 0.166 | 7.348 | | 0.5 | 0.332 | 4.376 | | 0.75 | 0.489 | 3.165 | | 0.95 | 0.608 | 2.441 | **Candidate** (this code): | filterSelectivity | recall | latency (ms) | | --------------- | ------ | ------------ | | 0.05 | 0.028 | 2.744 | | 0.25 | 0.157 | 4.614 | | 0.5 | 0.308 | 4.833 | | 0.75 | 0.449 | 4.622 | | 0.95 | 0.563 | 3.244 | --- Going forward, I think one thing that must be tested is correlation between the filter and the query vector. The paper discusses this at length. The risks lie in a negative correlation. To your point on multiple entry points to layer 0, the Weaviate article mentions using that approach to mitigate low correlations, so definitely something worth trying as well. I'm going to start by adding a correlation knob to the KNN testing suite in luceneutil. I figured I'd share this in the meantime in case you had any ideas. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org