[jira] [Commented] (LUCENE-9937) ann-benchmarks results for HNSW search

Julie Tibshirani (Jira) Wed, 04 Aug 2021 08:48:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393282#comment-17393282
 ]


Julie Tibshirani commented on LUCENE-9937:
------------------------------------------

I also found time to look at this more deeply. Although it may be annoying, I 
purposefully stuck with different benchmarking tools with the idea that some 
redundancy/ diversity can be helpful :)

First, I investigated your point about warm-up. This indeed seems to be 
critical – both paging in the graph and giving JIT enough time to complete. The 
first run in each of my benchmarks does seem to have been effected by a lack of 
warm-up. I tried running a couple of "throwaway" runs before beginning the 
actual benchmarks, and saw a noticeable improvement in the first run. It 
certainly doesn't close the gap, I wonder why I continue to see a large 
difference compared to your results:

*sift-128-euclidean, M=16, efConstruction=500*
{code:java}
Approach                 Recall  QPS
LuceneHnsw(n_cands=1000) 0.999      317.838
LuceneHnsw(n_cands=1000) 0.999      314.995
LuceneHnsw(n_cands=1000) 0.999      317.450
LuceneHnsw(n_cands=10)   0.603     9045.574
LuceneHnsw(n_cands=50)   0.890     3789.169
LuceneHnsw(n_cands=100)  0.942     2333.714
LuceneHnsw(n_cands=500)  0.996      579.850 
LuceneHnsw(n_cands=800)  0.998      386.105
{code}
Next, I compared the number of visited nodes, which correlates to the number of 
vector distance computations. To do this, I used FAISS's version of HNSW, which 
conveniently reports distance computations and exposes them in ann-benchmarks. 
For Lucene I used 'TopDocs.totalHits', although I switched this to use 
'BitSet#cardinality' instead of 'approximateCardinality' to be careful. The 
recall vs. dist comp curves looked roughly similar. Here's an example run:

*glove-100-angular, M=32, efConstruction=500*
{code:java}
Approach                                        Recall   Distance Comps
LuceneHnsw(n_cands=10)                           0.501      569.819
LuceneHnsw(n_cands=50)                           0.759     1653.680
LuceneHnsw(n_cands=100)                          0.832     2856.152
LuceneHnsw(n_cands=500)                          0.941    11438.336
LuceneHnsw(n_cands=800)                          0.961    17364.79              
                                   

faiss({'M': 32, 'efConstruction': 500}, ef: 10)  0.479      719.191
faiss({'M': 32, 'efConstruction': 500}, ef: 50)  0.824     2553.125
faiss({'M': 32, 'efConstruction': 500}, ef: 100) 0.900     4648.954
faiss({'M': 32, 'efConstruction': 500}, ef: 500) 0.985    19113.828
faiss({'M': 32, 'efConstruction': 500}, ef: 800) 0.993    28812.516
{code}
This aligns with your observation that there isn't a big difference in the 
number of 'visited' nodes. It was also surprising to see that FAISS HNSW has 
worse recall than hnswlib at the same parameters, and was substantially worse 
in terms of QPS. Perhaps there are some algorithmic improvements in hnswlib 
that aren't broadly known.

I also attached a profiler to get a sense of the hotspots. This is not new 
info, just confirms what you and Robert observed here: 
[https://github.com/apache/lucene/pull/18].
 * For searching, most time is spent comparing vectors 
('VectorSimilarityFunction#compare') and decoding vectors 
('ByteBufferIndexInput#readFloats').
 * For indexing, the time is dominated by the searches to add add each node. 
Most search time goes towards 'VectorSimilarityFunction#compare', since the 
graph is held in-memory and we don't need to decode vectors.

> ann-benchmarks results for HNSW search
> --------------------------------------
>
>                 Key: LUCENE-9937
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9937
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Julie Tibshirani
>            Priority: Minor
>
> I hooked up our HNSW implementation to 
> [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used 
> repo for benchmarking nearest neighbor search libraries against large 
> datasets. I found the results interesting and opened this issue to share and 
> discuss. My benchmarking code can be found 
> [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy 
> to commit but I’m happy to help anyone get set up with it.
> Approaches
>  * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a 
> brute force scan to determine nearest neighbors
>  * LuceneHnsw: our HNSW implementation
>  * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation 
> from the author of the paper
> Datasets
>  * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, 
> comparing euclidean distance
>  * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, 
> comparing cosine similarity
> *Results on sift-128-euclidean*
>  Parameters: M=16, efConstruction=500
> {code:java}
> Approach                Recall  QPS
> LuceneVectorsOnly()     1.000      6.764
> LuceneHnsw(n_cands=10)  0.603   7736.968
> LuceneHnsw(n_cands=50)  0.890   3605.000
> LuceneHnsw(n_cands=100) 0.953   2237.429
> LuceneHnsw(n_cands=500) 0.996    570.900
> LuceneHnsw(n_cands=800) 0.998    379.589
> hnswlib(n_cands=10)     0.713  69662.756
> hnswlib(n_cands=50)     0.950  28021.582
> hnswlib(n_cands=100)    0.985  16108.538
> hnswlib(n_cands=500)    1.000   4115.886
> hnswlib(n_cands=800)    1.000   2729.680
> {code}
> *Results on glove-100-angular*
>  Parameters: M=32, efConstruction=500
> {code:java}
> Approach                Recall  QPS
> LuceneVectorsOnly()     1.000      6.764
> LuceneHnsw(n_cands=10)  0.507   5036.236
> LuceneHnsw(n_cands=50)  0.760   2099.850
> LuceneHnsw(n_cands=100) 0.833   1233.690
> LuceneHnsw(n_cands=500) 0.941    309.077
> LuceneHnsw(n_cands=800) 0.961    203.782
> hnswlib(n_cands=10)     0.597  43543.345
> hnswlib(n_cands=50)     0.832  14719.165
> hnswlib(n_cands=100)    0.897   8299.948
> hnswlib(n_cands=500)    0.981   1931.985
> hnswlib(n_cands=800)    0.991    881.752
> {code}
> Notes on benchmark:
>  * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and 
> computes the recall against the true neighbors. The recall calculation has a 
> small 'fudge factor' that allows neighbors that are within a small epsilon of 
> the best distance. Queries are executed serially to obtain the QPS.
>  * I chose parameters where hnswlib performed well, then passed these same 
> parameters to Lucene HNSW. For index-time parameters, I set maxConn as M and 
> beamWidth as efConstruction. For search parameters, I set k to k, and fanout 
> as (num_cands - k) so that the beam search is of size num_cands. Note that 
> our default value for beamWidth is 16, which is really low – I wasn’t able to 
> obtain acceptable recall until I bumped it to closer to 500 to match the 
> hnswlib default.
>  * I force-merged to one segment before running searches since this gives the 
> best recall + QPS, and also to match hnswlib.
> Some observations:
>  * It'd be really nice to extend luceneutil to measure vector search recall 
> in addition to latency. That would help ensure we’re benchmarking a more 
> realistic scenario, instead of accidentally indexing/ searching at a very low 
> recall. Tracking recall would also guard against subtle, unintentional 
> changes to the algorithm. It's easy to make an accidental change while 
> refactoring, and with approximate algorithms, unit tests don't guard as well 
> against this.
>  * Lucene HNSW gives a great speed-up over the baseline without sacrificing 
> too much recall. But it doesn't perform as well as hnswlib in terms of both 
> recall and QPS. We wouldn’t expect the results to line up perfectly, since 
> Lucene doesn't actually implement HNSW – the current algorithm isn't actually 
> hierarchical and only uses a single graph layer. Does this difference might 
> indicate we're leaving performance 'on the table' by not using layers, which 
> (I don't think) adds much index time or space? Or are there other algorithmic 
> improvements would help close the gap?
>  * Setting beamWidth to 500 *really* slowed down indexing. I'll open a 
> separate issue with indexing speed results, keeping this one focused on 
> search.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9937) ann-benchmarks results for HNSW search

Reply via email to