alessandrobenedetti commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1751034549
> > The work here drastically changes the way also my Pull Request should look like right now. > > Yes, I am sorry about that. But the good news is that the integration for multi-value vectors has some nicer APIs to take advantage of (e.g. KnnCollector) and it could possibly copy/paste the deduplicating nearest neighbor min-heap implementation. > No worries at all! My work is still paused, looking for sponsors, so no harm! When I resume it as you said I may find benefits (and do improvements) to the new data structures added (I admint I got lost in the amount of KnnCollectors and similar classes added, but I'm super curious to explore each of them thoroughfully. > > As a side note, do you happen to have any performance benchmark? > > The following test was completed over 139004 documents with 768 float32 dimensions. > > The statistics for the nested value distributions: > > `1944` total unique documents `62.0` median number of nested values `71.50411522633745` mean number of nested values `309` max number of nested values `1` min number of nested values `2156.9469722481676` variance of nested values > > ``` > | 50th percentile latency | knn-search-10-100 | 3.10031 | ms | > | 90th percentile latency | knn-search-10-100 | 3.5629 | ms | > | 99th percentile latency | knn-search-10-100 | 4.60912 | ms | > | 99.9th percentile latency | knn-search-10-100 | 14.322 | ms | > | 100th percentile latency | knn-search-10-100 | 72.6463 | ms | > | 50th percentile latency | knn-nested-search-10-100 | 6.2615 | ms | > | 90th percentile latency | knn-nested-search-10-100 | 6.95849 | ms | > | 99th percentile latency | knn-nested-search-10-100 | 7.8881 | ms | > | 99.9th percentile latency | knn-nested-search-10-100 | 12.0871 | ms | > | 100th percentile latency | knn-nested-search-10-100 | 57.9238 | ms | > | 50th percentile latency | knn-search-100-1000 | 7.30288 | ms | > | 90th percentile latency | knn-search-100-1000 | 8.18694 | ms | > | 99th percentile latency | knn-search-100-1000 | 9.23673 | ms | > | 99.9th percentile latency | knn-search-100-1000 | 18.7072 | ms | > | 100th percentile latency | knn-search-100-1000 | 23.8712 | ms | > | 50th percentile latency | knn-search-nested-100-1000 | 26.6446 | ms | > | 90th percentile latency | knn-search-nested-100-1000 | 38.2561 | ms | > | 99th percentile latency | knn-search-nested-100-1000 | 44.3627 | ms | > | 99.9th percentile latency | knn-search-nested-100-1000 | 51.1843 | ms | > | 100th percentile latency | knn-search-nested-100-1000 | 52.0864 | ms | > ``` > > GASP! Nested seems 2x to 4x slower! > > But, keep in mind, we are eagerly joining! When I dug into the difference, I discovered that eagerly joining on this dataset meant we were visiting 3x to 5x more vectors. Consequently doing 3-5x more vector comparisons and deeper exploration of the graph. This lines up really nicely with the performance difference. > > Since HNSW is `log(n)` the overall constant overhead of nested seems rather minor compared to the need to gather nearest vectors. > > I am not sure these numbers are reflective of other nested/block-joining operations (like a term search). Interesting and thanks for the heads up, I hope to investigate this further as well in the future! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org