nitirajrathore opened a new issue, #12627:
URL: https://github.com/apache/lucene/issues/12627
### Description
I work for Amazon Retail Product search and we are using Lucene KNN for
semantic search of products.
We recently noticed that the hnsw graphs generated are not always strongly
connected and in worst case scenario some products may be undiscoverable.
Connectedness of Hierarchical graph can be complicated, so below I am
mentioning my experiment details.
### Experiment:
I took the Lucene indexes from our production servers and for each segment
(hnsw graph) I did following test.
At each level graph I took the same entry point, the entry point of HNSW
graph, checked how many nodes are reachable from this entrypoint. Note that
connectedness at each level was checked independently of other levels.
Sample code attached. My observations are as below.
- Observation :
1. Of all the graphs across all the segments, across 100s of indexes that I
considered, one graph for each "level" of HNSW, almost 18% of the graphs
had some disconnectedness.
2. Disconnectedness was observed at all the levels of HNSW graphs. I saw we
have
at most 3 to 4 levels in HNSW graphs.
3. percentage disconnectedness ranged from small fractions 0.000386% (1
disconnected out of 259342) to 3.7% (eg. 87 disconnected out of 2308).
In some extreme case the entry-point(EP) in zeroth level graph was
disconnected
from rest of the graph making the %age disconnected as high as 99.9% (only 65
reachable nodes from EP out of 252275). But this does not necessarily mean
that the 99.9% of nodes were not discoverable, it just means that if
unluckily we end up on EP in the 0th level graph for a query, there can at
max be 65 nodes that can be reached. But had we diverted our path from EP
to some other node in the upper level graphs then may be more nodes be
discoverable via that node.
### Reproduciability
Although it is fairly easy to be reproduced with Amazon's internal data,
but I found it really hard to reproduce it with random vectors. I will attached
a simple main class of my test and the parameters that I tried. I only got some
disconnectedness as listed in below table. For higher number of vectors, the
execution becomes painfully slow because of hnsw graph creation. I will
continue to work to find some parameters for reproducibilty and also explore
*open source datasets* for the same.
``DIS-L0 means: %age of disconnected nodes in graph at level-0 of HNSW
graph.``
| randomness seed | dimension | no of doc vectors | Max-conn |
Beam-width | dis-l0 | dis-l1 | dis-l2 | dis-l3 | dis-l4 |
| ---------- | --- | --------- | --- | --------- | ------- |
---------------------------- | ------- | ------- | ------- |
| -219343918 | 256 | 100_000 | 16 | 100 | 0 | 0.0158 (1 node
disconnected) | 0 | 0 | 0 |
| -245556271 | 256 | 100_000 | 16 | 100 | 0 | 0.0158 (1 node
disconnected) | 0 | 0 | 0 |
| -766887769 | 256 | 1_000_000 | 16 | 100 | 0.0003 | 0.04166
| 0 | 0 | 0 |
No direct solution comes to my mind but as per my discussion in this email
thread. I will look into the part where some connections are removed to
maintain the Max-Conn property (diversity check etc.).
### Version and environment details
Lucene version : 9.7
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]