[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571594#comment-17571594 ]
Michael Sokolov edited comment on LUCENE-10404 at 7/26/22 8:01 PM: ------------------------------------------------------------------- Here is a test using GloVe 100-dim vectors plus much more aggressive indexing settings, and we can see that here the IntIntHashMap is adding cost h3. baseline h3. {{recall latency nDoc fanout maxConn beamWidth visited index ms}} {{0.991 0.92 10000 50 64 500 150 12068}} {{0.996 1.11 10000 100 64 500 200 0}} {{0.999 1.45 10000 200 64 500 300 0}} {{1.000 1.94 10000 400 64 500 500 0}} {{0.955 2.53 100000 50 64 500 150 463142}} {{0.973 3.03 100000 100 64 500 200 0}} {{0.988 4.44 100000 200 64 500 300 0}} {{0.997 6.57 100000 400 64 500 500 0}} {{0.895 3.44 1000000 50 64 500 150 9811483}} {{0.920 4.33 1000000 100 64 500 200 0}} {{0.950 6.20 1000000 200 64 500 300 0}} {{0.974 9.53 1000000 400 64 500 500 0}} IntIntHashMap {{recall latency nDoc fanout maxConn beamWidth visited index ms}} {{0.991 1.03 10000 50 64 500 150 13274}} {{0.996 1.24 10000 100 64 500 200 0}} {{0.999 1.62 10000 200 64 500 300 0}} {{1.000 2.09 10000 400 64 500 500 0}} {{0.955 2.47 100000 50 64 500 150 485131}} {{0.973 3.31 100000 100 64 500 200 0}} {{0.988 4.66 100000 200 64 500 300 0}} {{0.997 7.26 100000 400 64 500 500 0}} {{0.895 3.58 1000000 50 64 500 150 10173818}} {{0.920 4.49 1000000 100 64 500 200 0}} {{0.950 6.45 1000000 200 64 500 300 0}} {{0.974 9.91 1000000 400 64 500 500 0}} was (Author: sokolov): Here is a test using GloVe 100-dim vectors plus much more aggressive indexing settings, and we can see that here the IntIntHashMap is adding cost h3. baseline {{recall latency nDoc fanout maxConn beamWidth visited index ms 0.991 0.92 10000 50 64 500 150 12068 0.996 1.11 10000 100 64 500 200 0 0.999 1.45 10000 200 64 500 300 0 1.000 1.94 10000 400 64 500 500 0 0.955 2.53 100000 50 64 500 150 463142 0.973 3.03 100000 100 64 500 200 0 0.988 4.44 100000 200 64 500 300 0 0.997 6.57 100000 400 64 500 500 0 0.895 3.44 1000000 50 64 500 150 9811483 0.920 4.33 1000000 100 64 500 200 0 0.950 6.20 1000000 200 64 500 300 0 0.974 9.53 1000000 400 64 500 500 0}} }} h3. IntIntHashMap {{recall latency nDoc fanout maxConn beamWidth visited index ms 0.991 1.03 10000 50 64 500 150 13274 0.996 1.24 10000 100 64 500 200 0 0.999 1.62 10000 200 64 500 300 0 1.000 2.09 10000 400 64 500 500 0 0.955 2.47 100000 50 64 500 150 485131 0.973 3.31 100000 100 64 500 200 0 0.988 4.66 100000 200 64 500 300 0 0.997 7.26 100000 400 64 500 500 0 0.895 3.58 1000000 50 64 500 150 10173818 0.920 4.49 1000000 100 64 500 200 0 0.950 6.45 1000000 200 64 500 300 0 0.974 9.91 1000000 400 64 500 500 0 }} > Use hash set for visited nodes in HNSW search? > ---------------------------------------------- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Julie Tibshirani > Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org