[ https://issues.apache.org/jira/browse/LUCENE-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264190#comment-17264190 ]
Michael McCandless commented on LUCENE-9644: -------------------------------------------- This change seems to have caused a surprising jump in Lucene's nightly {{VectorSearch}} performance: [https://home.apache.org/~mikemccand/lucenebench/VectorSearch.html] (annotation DJ). > HNSW diverse neighbor selection heuristic > ----------------------------------------- > > Key: LUCENE-9644 > URL: https://issues.apache.org/jira/browse/LUCENE-9644 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael Sokolov > Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > This will replace the simple nearest neighbor selection with a criterion that > takes into account the distance of the neighbors from each other. It is seen > to provide dramatically improved recall on at least two datasets, and is what > is being used by our reference implementation, hnswlib. The basic idea is > that when selecting the nearest neighbors to associate with a new node added > to the graph, we filter using a diversity criterion. If a candidate neighbor > is closer to an already-added (closer to the new node) neighbor than it is to > the new node, then we pass over it, moving on to more-distant, but presumably > more diverse neighbors. The same criterion is also (re-) applied to the > neighbor nodes' neighbors, since we add the links bidirectionally. > h2. Results: > h3. GloVe/Wikipedia > h4. baseline > ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index > ms|| > |0.643| 0.77| 100000| 50| 32| 64| 1742| 22830| > |0.671| 0.95| 100000| 100| 32| 64| 2141| 0| > |0.704| 1.32| 100000| 200| 32| 64| 2923| 0| > |0.739| 2.04| 100000| 400| 32| 64| 4382| 0| > |0.470| 0.91| 1000000| 50| 32| 64| 2068| 337081| > |0.496| 1.21| 1000000| 100| 32| 64| 2548| 0| > |0.533| 1.77| 1000000| 200| 32| 64| 3479| 0| > |0.573| 2.58| 1000000| 400| 32| 64| 5257| 0| > h4. diverse > ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index > ms|| > |0.801| 0.57| 100000| 50| 32| 64| 593| 17985| > |0.840| 0.67| 100000| 100| 32| 64| 738| 0| > |0.883| 0.97| 100000| 200| 32| 64| 1018| 0| > |0.921| 1.36| 100000| 400| 32| 64| 1502| 0| > |0.723| 0.71| 1000000| 50| 32| 64| 860| 298383| > |0.761| 0.77| 1000000| 100| 32| 64| 1058| 0| > |0.806| 1.06| 1000000| 200| 32| 64| 1442| 0| > |0.854| 1.67| 1000000| 400| 32| 64| 2159| 0| > h3. Dataset from work: > h4. baseline > ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index > ms|| > |0.933| 1.41| 100000| 50| 32| 64| 1496| 35462| > |0.948| 1.39| 100000| 100| 32| 64| 1872| 0| > |0.961| 2.10| 100000| 200| 32| 64| 2591| 0| > |0.972| 3.04| 100000| 400| 32| 64| 3939| 0| > |0.827| 1.34| 1000000| 50| 32| 64| 1676| 535802| > |0.854| 1.76| 1000000| 100| 32| 64| 2056| 0| > |0.887| 2.47| 1000000| 200| 32| 64| 2761| 0| > |0.907| 3.75| 1000000| 400| 32| 64| 4129| 0| > h4. diverse > ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index > ms|| > |0.966| 1.18| 100000| 50| 32| 64| 1480| 37656| > |0.977| 1.46| 100000| 100| 32| 64| 1832| 0| > |0.988| 2.00| 100000| 200| 32| 64| 2472| 0| > |0.995| 3.14| 100000| 400| 32| 64| 3629| 0| > |0.944| 1.34| 1000000| 50| 32| 64| 1780| 526834| > |0.959| 1.71| 1000000| 100| 32| 64| 2222| 0| > |0.975| 2.30| 1000000| 200| 32| 64| 3041| 0| > |0.986| 3.56| 1000000| 400| 32| 64| 4543| 0| -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org