benwtrent commented on PR #12434:
URL: https://github.com/apache/lucene/pull/12434#issuecomment-1750978708

   > The work here drastically changes the way also my Pull Request should look 
like right now.
   
   Yes, I am sorry about that. But the good news is that the integration for 
multi-value vectors has some nicer APIs to take advantage of (e.g. 
KnnCollector) and it could possibly copy/paste the deduplicating nearest 
neighbor min-heap implementation.
   
   > As a side note, do you happen to have any performance benchmark?
   
   The following test was completed over 139004 documents with 768 float32 
dimensions.
   
   The statistics for the nested value distributions:
   
   `1944` total unique documents
   `62.0` median number of nested values
   `71.50411522633745` mean number of nested values
   `309` max number of nested values
   `1` min number of nested values
   `2156.9469722481676` variance of nested values
   
   ```
   |                                        50th percentile latency |          
knn-search-10-100 |   3.10031     |     ms |
   |                                        90th percentile latency |          
knn-search-10-100 |   3.5629      |     ms |
   |                                        99th percentile latency |          
knn-search-10-100 |   4.60912     |     ms |
   |                                      99.9th percentile latency |          
knn-search-10-100 |  14.322       |     ms |
   |                                       100th percentile latency |          
knn-search-10-100 |  72.6463      |     ms |
   |                                        50th percentile latency |   
knn-nested-search-10-100 |   6.2615      |     ms |
   |                                        90th percentile latency |   
knn-nested-search-10-100 |   6.95849     |     ms |
   |                                        99th percentile latency |   
knn-nested-search-10-100 |   7.8881      |     ms |
   |                                      99.9th percentile latency |   
knn-nested-search-10-100 |  12.0871      |     ms |
   |                                       100th percentile latency |   
knn-nested-search-10-100 |  57.9238      |     ms |
   |                                        50th percentile latency |        
knn-search-100-1000 |   7.30288     |     ms |
   |                                        90th percentile latency |        
knn-search-100-1000 |   8.18694     |     ms |
   |                                        99th percentile latency |        
knn-search-100-1000 |   9.23673     |     ms |
   |                                      99.9th percentile latency |        
knn-search-100-1000 |  18.7072      |     ms |
   |                                       100th percentile latency |        
knn-search-100-1000 |  23.8712      |     ms |
   |                                        50th percentile latency | 
knn-search-nested-100-1000 |  26.6446      |     ms |
   |                                        90th percentile latency | 
knn-search-nested-100-1000 |  38.2561      |     ms |
   |                                        99th percentile latency | 
knn-search-nested-100-1000 |  44.3627      |     ms |
   |                                      99.9th percentile latency | 
knn-search-nested-100-1000 |  51.1843      |     ms |
   |                                       100th percentile latency | 
knn-search-nested-100-1000 |  52.0864      |     ms |
   ```
   
   GASP! Nested seems 2x to 4x slower!
   
   But, keep in mind, we are eagerly joining! When I dug into the difference, I 
discovered that eagerly joining on this dataset meant we were visiting 3x to 5x 
more vectors. Consequently doing 3-5x more vector comparisons and deeper 
exploration of the graph. This lines up really nicely with the performance 
difference.
   
   Since HNSW is `log(n)` the overall constant overhead of nested seems rather 
minor compared to the need to gather nearest vectors.
   
   I am not sure these numbers are reflective of other nested/block-joining 
operations (like a term search).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to