[jira] [Commented] (LUCENE-9937) ann-benchmarks results for HNSW search

Michael Sokolov (Jira) Tue, 01 Jun 2021 08:14:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355161#comment-17355161
 ]


Michael Sokolov commented on LUCENE-9937:
-----------------------------------------

I ran some further tests using hnswlib and KnnGraphTester, trying to look more 
closely at how they differ. The test setup was: index 1M vectors (I used the 
enwiki GloVe-100 vectors from luceneutil) using M=16, 
efConstruction/beam-width=500, and then run 10K queries retrieving top K=10 for 
each. For the latency comparison table at the end, I set fanout for the Lucene 
KNN = 5 since this yielded the same 0.778 recall as I calculated for hnswlib 
with these settings. First I looked at some deeper per-search internal metrics: 
"hop count" and "number of updates", as well as total number of nodes visited 
(proxy for number of distance calculations). "hop count" says how many graph 
arcs we traverse during the search; we compute distances for all the neighbors 
of each visted node, but many of those neighbors get discarded, so we never 
"hop" to them. "update count" says how many times we found a competitive node 
and update the priority queue.

 

I hacked hnswlib so it would emit these metrics in a "debug" mode, by printing 
them to stdout, and then calculated some summary statistics. These tables were 
computed with beam-width=200, not 500, and fanout=0, and I did not compute 
recall against true-NN, but used the score distribution as a proxy for that:
h3. hnswlib
{noformat}
     max.score        mean.score        visited           hops         updates  
    
 Min.   :0.6565   Min.   :0.6364   Min.   : 43.0   Min.   :10.0   Min.   : 
17.00  
 1st Qu.:0.9680   1st Qu.:0.9585   1st Qu.:120.0   1st Qu.:11.0   1st Qu.: 
31.00  
 Median :0.9871   Median :0.9841   Median :173.0   Median :12.0   Median : 
38.00  
 Mean   :0.9734   Mean   :0.9667   Mean   :175.7   Mean   :12.7   Mean   : 
42.35  
 3rd Qu.:0.9954   3rd Qu.:0.9928   3rd Qu.:217.0   3rd Qu.:14.0   3rd Qu.: 
47.00  
 Max.   :0.9997   Max.   :0.9996   Max.   :463.0   Max.   :33.0   Max.   
:195.00 {noformat}
h3. lucene-knn
{noformat}
   top.score        mean.score        visited           hops          updates   
  
 Min.   :0.6995   Min.   :0.6843   Min.   : 82.0   Min.   :12.00   Min.   : 
41.0  
 1st Qu.:0.9681   1st Qu.:0.9582   1st Qu.:181.0   1st Qu.:18.00   1st Qu.: 
83.0  
 Median :0.9870   Median :0.9839   Median :205.0   Median :20.00   Median : 
98.0  
 Mean   :0.9731   Mean   :0.9664   Mean   :210.7   Mean   :20.55   Mean   
:100.3  
 3rd Qu.:0.9953   3rd Qu.:0.9928   3rd Qu.:237.0   3rd Qu.:23.00   3rd 
Qu.:113.0  
 Max.   :0.9997   Max.   :0.9996   Max.   :391.0   Max.   :39.00   Max.   
:232.0{noformat}
I think this shows the two implementations are performing similarly, in the 
sense that they produce similar-scoring results for similar parameter settings 
(see max score and mean score). Lucene KNN visits more nodes, but not that many 
more. It has quite a few more hops and updates. I think this is accounted for 
by the fact that we don't have heirarchical nodes. Also - in the hnswlib case I 
did not include the traversal of the upper layers of the hierarchical graph - 
this would add a few more visits, hops, and updates, although not as many as we 
need to have since we start at random places in the graph, and it takes a few 
more hops to converge on the nodes near to the target.

Removing the debug code, I ran a latency comparison, and in this test 
||impl||recall||latency (ms)||warm count||
|hnswlib|0.778|0.032|n/a|
|lucene|0.778|0.54|cold|
|lucene|0.778|0.11|1000|
|lucene|0.778|0.08|10000|

Clearly there is a strong impact of warming; in KnnGraphTester warming is 
implemented by pre-computing matches from the test set with the idea of paging 
in the index (and the test vectors) as well as giving hotspot compiler a chance 
to work its magic. This warming is kind of cheating since it is computing the 
same results we compute in the test, but it does show the difficulty of 
head-to-head latency comparisons of these different systems, and the 
sensitivity of the Java/Lucene implementation to such effects. So I think we 
need to take the ann-benchmarks results with a grain of salt; the test 
framework there isn't really set up for such warming. Maybe we can tweak it to 
do this somehow?

Another takeaway is that implementing the "H" in HNSW might actually be worth 
it; it clearly saves some work in terms of updating the priority queue, and 
doing traversals, and to some extent on the number of visited nodes as well.

 

> ann-benchmarks results for HNSW search
> --------------------------------------
>
>                 Key: LUCENE-9937
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9937
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Julie Tibshirani
>            Priority: Minor
>
> I hooked up our HNSW implementation to 
> [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used 
> repo for benchmarking nearest neighbor search libraries against large 
> datasets. I found the results interesting and opened this issue to share and 
> discuss. My benchmarking code can be found 
> [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy 
> to commit but I’m happy to help anyone get set up with it.
> Approaches
>  * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a 
> brute force scan to determine nearest neighbors
>  * LuceneHnsw: our HNSW implementation
>  * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation 
> from the author of the paper
> Datasets
>  * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, 
> comparing euclidean distance
>  * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, 
> comparing cosine similarity
> *Results on sift-128-euclidean*
>  Parameters: M=16, efConstruction=500
> {code:java}
> Approach                Recall  QPS
> LuceneVectorsOnly()     1.000      6.764
> LuceneHnsw(n_cands=10)  0.603   7736.968
> LuceneHnsw(n_cands=50)  0.890   3605.000
> LuceneHnsw(n_cands=100) 0.953   2237.429
> LuceneHnsw(n_cands=500) 0.996    570.900
> LuceneHnsw(n_cands=800) 0.998    379.589
> hnswlib(n_cands=10)     0.713  69662.756
> hnswlib(n_cands=50)     0.950  28021.582
> hnswlib(n_cands=100)    0.985  16108.538
> hnswlib(n_cands=500)    1.000   4115.886
> hnswlib(n_cands=800)    1.000   2729.680
> {code}
> *Results on glove-100-angular*
>  Parameters: M=32, efConstruction=500
> {code:java}
> Approach                Recall  QPS
> LuceneVectorsOnly()     1.000      6.764
> LuceneHnsw(n_cands=10)  0.507   5036.236
> LuceneHnsw(n_cands=50)  0.760   2099.850
> LuceneHnsw(n_cands=100) 0.833   1233.690
> LuceneHnsw(n_cands=500) 0.941    309.077
> LuceneHnsw(n_cands=800) 0.961    203.782
> hnswlib(n_cands=10)     0.597  43543.345
> hnswlib(n_cands=50)     0.832  14719.165
> hnswlib(n_cands=100)    0.897   8299.948
> hnswlib(n_cands=500)    0.981   1931.985
> hnswlib(n_cands=800)    0.991    881.752
> {code}
> Notes on benchmark:
>  * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and 
> computes the recall against the true neighbors. The recall calculation has a 
> small 'fudge factor' that allows neighbors that are within a small epsilon of 
> the best distance. Queries are executed serially to obtain the QPS.
>  * I chose parameters where hnswlib performed well, then passed these same 
> parameters to Lucene HNSW. For index-time parameters, I set maxConn as M and 
> beamWidth as efConstruction. For search parameters, I set k to k, and fanout 
> as (num_cands - k) so that the beam search is of size num_cands. Note that 
> our default value for beamWidth is 16, which is really low – I wasn’t able to 
> obtain acceptable recall until I bumped it to closer to 500 to match the 
> hnswlib default.
>  * I force-merged to one segment before running searches since this gives the 
> best recall + QPS, and also to match hnswlib.
> Some observations:
>  * It'd be really nice to extend luceneutil to measure vector search recall 
> in addition to latency. That would help ensure we’re benchmarking a more 
> realistic scenario, instead of accidentally indexing/ searching at a very low 
> recall. Tracking recall would also guard against subtle, unintentional 
> changes to the algorithm. It's easy to make an accidental change while 
> refactoring, and with approximate algorithms, unit tests don't guard as well 
> against this.
>  * Lucene HNSW gives a great speed-up over the baseline without sacrificing 
> too much recall. But it doesn't perform as well as hnswlib in terms of both 
> recall and QPS. We wouldn’t expect the results to line up perfectly, since 
> Lucene doesn't actually implement HNSW – the current algorithm isn't actually 
> hierarchical and only uses a single graph layer. Does this difference might 
> indicate we're leaving performance 'on the table' by not using layers, which 
> (I don't think) adds much index time or space? Or are there other algorithmic 
> improvements would help close the gap?
>  * Setting beamWidth to 500 *really* slowed down indexing. I'll open a 
> separate issue with indexing speed results, keeping this one focused on 
> search.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9937) ann-benchmarks results for HNSW search

Reply via email to