benchaplin commented on issue #13940:
URL: https://github.com/apache/lucene/issues/13940#issuecomment-2557564928

   Hey @benwtrent - I played around with implementing this.
   
   I tried a bunch of things, benchmarking as I went. Here's what I found:
   - The performance improvements seem to come from the "predicate subgraph 
traversal" (aka only score/consider candidates that pass the filter). This 
saves time because it doesn't score nodes that don't pass the filter, and thus 
won't be collected as results. However, this approach risks missing 
filter-passing results 2+ hops away.
   - Two-hop neighbor expansion mitigates the above risk. I was getting decent 
results doing predicate subgraph traversal + two-hop expansion.
   - [Weaviate](https://weaviate.io/blog/speed-up-filtered-vector-search) 
"conditionally evaluates whether to use the two-hop expansion" - I tried this 
and saw lower latencies for inclusive filters, so I added it as well.
   
   Here's a draft PR with the best benchmarking code I found: 
https://github.com/apache/lucene/pull/14085.
   
   I benchmarked using Cohere embeddings with params:
   - nDoc = 200000
   - topK = 100
   - fanout = 50
   - maxConn = 32
   - beamWidth = 100
   - filterSelectivity = [0.05, 0.25, 0.5, 0.75, 0.95]
   
   Here are some results:
   
   **Baseline:**
   | filterSelectivity | recall | latency (ms) |
   | --------------- | ------ | ------------ |
   | 0.05 | 0.037  | 17.182       |
   | 0.25 | 0.166  | 7.348        |
   | 0.5 | 0.332  | 4.376        |
   | 0.75 | 0.489  | 3.165        |
   | 0.95 | 0.608  | 2.441        |
   
   **Candidate** (this code): 
   | filterSelectivity | recall | latency (ms) |
   | --------------- | ------ | ------------ |
   | 0.05 | 0.028  | 2.744        |
   | 0.25 | 0.157  | 4.614        |
   | 0.5 | 0.308  | 4.833        |
   | 0.75 | 0.449  | 4.622        |
   | 0.95 | 0.563  | 3.244        |
   
   ---
   
   Going forward, I think one thing that must be tested is correlation between 
the filter and the query vector. The paper discusses this at length. The risks 
lie in a negative correlation. To your point on multiple entry points to layer 
0, the Weaviate article mentions using that approach to mitigate low 
correlations, so definitely something worth trying as well.
   
   I'm going to start by adding a correlation knob to the KNN testing suite in 
luceneutil. I figured I'd share this in the meantime in case you had any ideas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to