[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569420#comment-17569420 ]
Michael Sokolov edited comment on LUCENE-10404 at 7/21/22 1:39 PM: ------------------------------------------------------------------- I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does seem to be a small speedup. I haven't had a chance to run luceneutil nor look at profiler output, but here are some numbers from KnnGraphTester for an internal dataset. The numbers can be a bit noisy, but are consistently better for the hash map version. h3. IntIntHashMap {{{{recall latency nDoc fanout maxConn beamWidth visited index ms}}}} {{{{0.935 0.37 10000 0 16 32 100 1566}}}} {{{{0.965 0.49 10000 50 16 32 150 0}}}} {{{{0.962 0.41 10000 0 16 64 100 2655}}}} {{{{0.982 0.57 10000 50 16 64 150 0}}}} {{{{0.941 0.38 10000 0 32 32 100 1473}}}} {{{{0.969 0.51 10000 50 32 32 150 0}}}} {{{{0.966 0.45 10000 0 32 64 100 2611}}}} {{{{0.985 0.59 10000 50 32 64 150 0}}}} {{{{0.907 0.52 100000 0 16 32 100 19850}}}} {{{{0.940 0.72 100000 50 16 32 150 0}}}} {{{{0.941 0.60 100000 0 16 64 100 38614}}}} {{{{0.966 0.84 100000 50 16 64 150 0}}}} {{{{0.916 0.55 100000 0 32 32 100 19243}}}} {{{{0.949 0.75 100000 50 32 32 150 0}}}} {{{{0.952 0.66 100000 0 32 64 100 38205}}}} {{{{0.973 0.93 100000 50 32 64 150 0}}}} {{{{0.859 0.66 1000000 0 16 32 100 273112}}}} {{{{0.897 0.92 1000000 50 16 32 150 0}}}} {{0.917 0.85 1000000 0 16 64 100 523325}} {{0.946 1.06 1000000 50 16 64 150 0}} more to come – pushed ctrl-enter instead of enter ... h3. baseline {{recall latency nDoc fanout maxConn beamWidth visited index ms}} {{0.935 0.38 10000 0 16 32 100 1614}} {{0.965 0.50 10000 50 16 32 150 0}} {{0.962 0.45 10000 0 16 64 100 2687}} {{0.982 0.57 10000 50 16 64 150 0}} {{0.941 0.40 10000 0 32 32 100 1504}} {{0.969 0.51 10000 50 32 32 150 0}} {{0.966 0.44 10000 0 32 64 100 2652}} {{0.985 0.58 10000 50 32 64 150 0}} {{0.907 0.54 100000 0 16 32 100 21449}} {{0.940 0.74 100000 50 16 32 150 0}} {{0.941 0.64 100000 0 16 64 100 39962}} {{0.966 0.88 100000 50 16 64 150 0}} {{0.916 0.59 100000 0 32 32 100 20554}} {{0.949 0.80 100000 50 32 32 150 0}} {{0.952 0.72 100000 0 32 64 100 40980}} {{0.973 1.04 100000 50 32 64 150 0}} {{0.859 0.75 1000000 0 16 32 100 300514}} {{0.897 0.96 1000000 50 16 32 150 0}} {{0.917 0.84 1000000 0 16 64 100 563259}} {{0.946 1.12 1000000 50 16 64 150 0}} {{0.874 0.86 1000000 0 32 32 100 303186}} {{0.913 1.09 1000000 50 32 32 150 0}} {{0.929 1.04 1000000 0 32 64 100 580725}} {{0.958 1.38 1000000 50 32 64 150 0}} was (Author: sokolov): I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does seem to be a small speedup. I haven't had a chance to run luceneutil nor look at profiler output, but here are some numbers from KnnGraphTester for an internal dataset. The numbers can be a bit noisy, but are consistently better for the hash map version. h3. IntIntHashMap {{recall latency nDoc fanout maxConn beamWidth visited index ms}} {{0.935 0.37 10000 0 16 32 100 1566}} {{0.965 0.49 10000 50 16 32 150 0}} {{0.962 0.41 10000 0 16 64 100 2655}} {{0.982 0.57 10000 50 16 64 150 0}} {{0.941 0.38 10000 0 32 32 100 1473}} {{0.969 0.51 10000 50 32 32 150 0}} {{0.966 0.45 10000 0 32 64 100 2611}} {{0.985 0.59 10000 50 32 64 150 0}} {{0.907 0.52 100000 0 16 32 100 19850}} {{0.940 0.72 100000 50 16 32 150 0}} {{0.941 0.60 100000 0 16 64 100 38614}} {{0.966 0.84 100000 50 16 64 150 0}} {{0.916 0.55 100000 0 32 32 100 19243}} {{0.949 0.75 100000 50 32 32 150 0}} {{0.952 0.66 100000 0 32 64 100 38205}} {{0.973 0.93 100000 50 32 64 150 0}} {{0.859 0.66 1000000 0 16 32 100 273112}} {{{}0.897 0.92 1000000 50 16 32 150 0{}}}{{{}0.917 0.85 1000000 0 16 64 100 523325 0.946 1.06 1000000 50 16 64 150 0 {}}} h3. baseline {{recall latency nDoc fanout maxConn beamWidth visited index ms}} {{0.935 0.38 10000 0 16 32 100 1614}} {{0.965 0.50 10000 50 16 32 150 0}} {{0.962 0.45 10000 0 16 64 100 2687}} {{0.982 0.57 10000 50 16 64 150 0}} {{0.941 0.40 10000 0 32 32 100 1504}} {{0.969 0.51 10000 50 32 32 150 0}} {{0.966 0.44 10000 0 32 64 100 2652}} {{0.985 0.58 10000 50 32 64 150 0}} {{0.907 0.54 100000 0 16 32 100 21449}} {{0.940 0.74 100000 50 16 32 150 0}} {{0.941 0.64 100000 0 16 64 100 39962}} {{0.966 0.88 100000 50 16 64 150 0}} {{0.916 0.59 100000 0 32 32 100 20554}} {{0.949 0.80 100000 50 32 32 150 0}} {{0.952 0.72 100000 0 32 64 100 40980}} {{0.973 1.04 100000 50 32 64 150 0}} {{0.859 0.75 1000000 0 16 32 100 300514}} {{0.897 0.96 1000000 50 16 32 150 0}} {{0.917 0.84 1000000 0 16 64 100 563259}} {{0.946 1.12 1000000 50 16 64 150 0}} {{0.874 0.86 1000000 0 32 32 100 303186}} {{0.913 1.09 1000000 50 32 32 150 0}} {{0.929 1.04 1000000 0 32 64 100 580725}} {{0.958 1.38 1000000 50 32 64 150 0}} > Use hash set for visited nodes in HNSW search? > ---------------------------------------------- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Julie Tibshirani > Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org