mikemccand commented on issue #15509:
URL: https://github.com/apache/lucene/issues/15509#issuecomment-3666422833
Oooh +1! `CheckIndex` is already walking the HNSW graph already to check
integrity or so (no duplicated transitions at least)? So maybe getting these
stats and printing them when users asks for `-verbose` is simple? Those stats
can be helpful. For example you could compare two graphs that are supposed to
be similar (say indexed from same set of documents, but maybe in different
order or so) and gauge the aggregate statistics (histograms showing how bushy
the nodes generally are).
We tell HNSW construction it can add up to `maxConn` edges to each node, but
it often/typically uses fewer. Here's a recent output from one of my
`knnPerfTest.py` runs:
```
Graph level=2 size=51, Fanout min=4, mean=9.33, max=14, meandelta=25557.37
% 0 10 20 30 40 50 60 70 80 90 100
0 5 7 7 9 9 10 10 12 13 14
Graph level=1 size=5870, Fanout min=11, mean=40.85, max=64, meandelta=9241.01
% 0 10 20 30 40 50 60 70 80 90 100
0 24 28 31 35 39 43 49 56 64 64
Graph level=0 size=400000, Fanout min=1, mean=63.43, max=128,
meandelta=6004.07
% 0 10 20 30 40 50 60 70 80 90 100
0 31 38 44 49 56 64 74 89 115 128
Graph level=2 size=51, connectedness=1.00
Graph level=1 size=5870, connectedness=1.00
Graph level=0 size=400000, connectedness=1.00
```
So P50 at level=0 (all vectors) is 56 connected nodes.
Hmm, why is P100 128? I had run this with `maxConn=64`. Are we somehow
doubling this somewhere? Maybe `knnPerfTest` is doing something fishy?
See! This is why such transparency is so helpful :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]