Re: [PR] Add ParentJoin KNN support [lucene]

via GitHub Fri, 06 Oct 2023 09:18:21 -0700


alessandrobenedetti commented on PR #12434:
URL: https://github.com/apache/lucene/pull/12434#issuecomment-1751034549


   > > The work here drastically changes the way also my Pull Request should 
look like right now.
   > 
   > Yes, I am sorry about that. But the good news is that the integration for 
multi-value vectors has some nicer APIs to take advantage of (e.g. 
KnnCollector) and it could possibly copy/paste the deduplicating nearest 
neighbor min-heap implementation.
   > 
   No worries at all! My work is still paused, looking for sponsors, so no 
harm! When I resume it as you said I may find benefits (and do improvements) to 
the new data structures added (I admint I got lost in the amount of 
KnnCollectors and similar classes added, but I'm super curious to explore each 
of them thoroughfully. 
   > > As a side note, do you happen to have any performance benchmark?
   > 
   > The following test was completed over 139004 documents with 768 float32 
dimensions.
   > 
   > The statistics for the nested value distributions:
   > 
   > `1944` total unique documents `62.0` median number of nested values 
`71.50411522633745` mean number of nested values `309` max number of nested 
values `1` min number of nested values `2156.9469722481676` variance of nested 
values
   > 
   > ```
   > |                                        50th percentile latency |         
 knn-search-10-100 |   3.10031     |     ms |
   > |                                        90th percentile latency |         
 knn-search-10-100 |   3.5629      |     ms |
   > |                                        99th percentile latency |         
 knn-search-10-100 |   4.60912     |     ms |
   > |                                      99.9th percentile latency |         
 knn-search-10-100 |  14.322       |     ms |
   > |                                       100th percentile latency |         
 knn-search-10-100 |  72.6463      |     ms |
   > |                                        50th percentile latency |   
knn-nested-search-10-100 |   6.2615      |     ms |
   > |                                        90th percentile latency |   
knn-nested-search-10-100 |   6.95849     |     ms |
   > |                                        99th percentile latency |   
knn-nested-search-10-100 |   7.8881      |     ms |
   > |                                      99.9th percentile latency |   
knn-nested-search-10-100 |  12.0871      |     ms |
   > |                                       100th percentile latency |   
knn-nested-search-10-100 |  57.9238      |     ms |
   > |                                        50th percentile latency |        
knn-search-100-1000 |   7.30288     |     ms |
   > |                                        90th percentile latency |        
knn-search-100-1000 |   8.18694     |     ms |
   > |                                        99th percentile latency |        
knn-search-100-1000 |   9.23673     |     ms |
   > |                                      99.9th percentile latency |        
knn-search-100-1000 |  18.7072      |     ms |
   > |                                       100th percentile latency |        
knn-search-100-1000 |  23.8712      |     ms |
   > |                                        50th percentile latency | 
knn-search-nested-100-1000 |  26.6446      |     ms |
   > |                                        90th percentile latency | 
knn-search-nested-100-1000 |  38.2561      |     ms |
   > |                                        99th percentile latency | 
knn-search-nested-100-1000 |  44.3627      |     ms |
   > |                                      99.9th percentile latency | 
knn-search-nested-100-1000 |  51.1843      |     ms |
   > |                                       100th percentile latency | 
knn-search-nested-100-1000 |  52.0864      |     ms |
   > ```
   > 
   > GASP! Nested seems 2x to 4x slower!
   > 
   > But, keep in mind, we are eagerly joining! When I dug into the difference, 
I discovered that eagerly joining on this dataset meant we were visiting 3x to 
5x more vectors. Consequently doing 3-5x more vector comparisons and deeper 
exploration of the graph. This lines up really nicely with the performance 
difference.
   > 
   > Since HNSW is `log(n)` the overall constant overhead of nested seems 
rather minor compared to the need to gather nearest vectors.
   > 
   > I am not sure these numbers are reflective of other nested/block-joining 
operations (like a term search).
   Interesting and thanks for the heads up, I hope to investigate this further 
as well in the future!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add ParentJoin KNN support [lucene]

Reply via email to