[
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571594#comment-17571594
]
Michael Sokolov edited comment on LUCENE-10404 at 7/26/22 8:01 PM:
-------------------------------------------------------------------
Here is a test using GloVe 100-dim vectors plus much more aggressive indexing
settings, and we can see that here the IntIntHashMap is adding cost
h3. baseline
h3. {{recall latency nDoc fanout maxConn beamWidth visited index ms}}
{{0.991 0.92 10000 50 64 500 150 12068}}
{{0.996 1.11 10000 100 64 500 200 0}}
{{0.999 1.45 10000 200 64 500 300 0}}
{{1.000 1.94 10000 400 64 500 500 0}}
{{0.955 2.53 100000 50 64 500 150 463142}}
{{0.973 3.03 100000 100 64 500 200 0}}
{{0.988 4.44 100000 200 64 500 300 0}}
{{0.997 6.57 100000 400 64 500 500 0}}
{{0.895 3.44 1000000 50 64 500 150 9811483}}
{{0.920 4.33 1000000 100 64 500 200 0}}
{{0.950 6.20 1000000 200 64 500 300 0}}
{{0.974 9.53 1000000 400 64 500 500 0}}
IntIntHashMap
{{recall latency nDoc fanout maxConn beamWidth visited index ms}}
{{0.991 1.03 10000 50 64 500 150 13274}}
{{0.996 1.24 10000 100 64 500 200 0}}
{{0.999 1.62 10000 200 64 500 300 0}}
{{1.000 2.09 10000 400 64 500 500 0}}
{{0.955 2.47 100000 50 64 500 150 485131}}
{{0.973 3.31 100000 100 64 500 200 0}}
{{0.988 4.66 100000 200 64 500 300 0}}
{{0.997 7.26 100000 400 64 500 500 0}}
{{0.895 3.58 1000000 50 64 500 150 10173818}}
{{0.920 4.49 1000000 100 64 500 200 0}}
{{0.950 6.45 1000000 200 64 500 300 0}}
{{0.974 9.91 1000000 400 64 500 500 0}}
was (Author: sokolov):
Here is a test using GloVe 100-dim vectors plus much more aggressive indexing
settings, and we can see that here the IntIntHashMap is adding cost
h3. baseline
{{recall latency nDoc fanout maxConn beamWidth visited index ms
0.991 0.92 10000 50 64 500 150 12068
0.996 1.11 10000 100 64 500 200 0
0.999 1.45 10000 200 64 500 300 0
1.000 1.94 10000 400 64 500 500 0
0.955 2.53 100000 50 64 500 150 463142
0.973 3.03 100000 100 64 500 200 0
0.988 4.44 100000 200 64 500 300 0
0.997 6.57 100000 400 64 500 500 0
0.895 3.44 1000000 50 64 500 150 9811483
0.920 4.33 1000000 100 64 500 200 0
0.950 6.20 1000000 200 64 500 300 0
0.974 9.53 1000000 400 64 500 500 0}}
}}
h3. IntIntHashMap
{{recall latency nDoc fanout maxConn beamWidth visited index ms
0.991 1.03 10000 50 64 500 150 13274
0.996 1.24 10000 100 64 500 200 0
0.999 1.62 10000 200 64 500 300 0
1.000 2.09 10000 400 64 500 500 0
0.955 2.47 100000 50 64 500 150 485131
0.973 3.31 100000 100 64 500 200 0
0.988 4.66 100000 200 64 500 300 0
0.997 7.26 100000 400 64 500 500 0
0.895 3.58 1000000 50 64 500 150 10173818
0.920 4.49 1000000 100 64 500 200 0
0.950 6.45 1000000 200 64 500 300 0
0.974 9.91 1000000 400 64 500 500 0
}}
> Use hash set for visited nodes in HNSW search?
> ----------------------------------------------
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Julie Tibshirani
> Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited
> using a BitSet. We could look into using something like IntHashSet instead. I
> tried out the idea quickly by switching to IntIntHashMap (which has already
> been copied from hppc) and saw an improvement in index performance.
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search
> visits ~1000 - 15,000 docs depending on the recall. This number can increase
> when searching with deleted docs, especially if you hit a "pathological" case
> where the deleted docs happen to be closest to the query vector.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]