nitirajrathore commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1814046690
> What if we added an "incoming connection" count for every node? & > I think this idea would prevent the isolated nodes, but not fix the other case. I will write a small code to check the type of disconnectedness. Clustered vs single > but then we (I think?) decided that wasn't the approach advocated by the academic papers we were emulating, and implemented the diversity check to every added neighbor Paper proposes 3 different heuristics. 1. check diversity before adding any node. -- `Standard` in paper. ([getNeighborsByHeuristic2()](https://github.com/nmslib/nmslib/blob/master/similarity_search/include/method/hnsw.h#L130) in the code) 2. check diversity before adding any node, but if the nodes added are lesser than the max-conn then add atleast max-conn nodes. `keepPrunedConnections` in paper ([getNeighborsByHeuristic1()](https://github.com/nmslib/nmslib/blob/master/similarity_search/include/method/hnsw.h#L83C34-L83C34) in the code) 3. add all the neighbours of neighbours in the candidate list before finding the diverse nodes as in (1) and optionally apply `keepPrunedConnections` as well. Called `extendCandidates` in paper. ( Not sure but, I guess this is the [getNeighborsByHeuristic3()](https://github.com/nmslib/nmslib/blob/master/similarity_search/include/method/hnsw.h#L171) in the code. but need to understand it a bit.) Our current implementation is like (1) with some optimizations for performance. My proposed approach is similar to (2) with more strict conditions by checking diversity only for nodes which are surely connected via their neighbours. I earlier tried to implement (2) just by keeping the pruned connections till max-conn. As I mentioned, this increased the disconnected nodes by a factor instead of decreasing it. The reason I understood was that now there was more competition between nodes as every node now had max-conn number of connections all the time, triggering the `findWorstNonDiverse()` to be called every time for every neighbour, and most of the time the just now newly formed edge will be non-diverse and will get dropped. In this I think there might be something wrong in the implementation of `checked` and `unchecked` checking that we are doing in `findWorstNonDiverse()`. I am not able to get my head around that checked-unchecked code completely, but I think it give green signal to all the current neighbours as diverse incorrectly and checking just the 'unchecked' connections triggers the diversity check only for the newly added edge which it founds to be non-diverse and removes it. I think there is huge difference between the number of connections between heuristic (1) and (2), and by inference between the current implementation and my proposed implementation, so I will post numbers around that. But I don't think that is necessarily bad, if the number of avg connections are increasing it is also increasing the recall with ExactKNN, meaning that we can get same recall for a smaller max-conn value now. I will also try to get some number around same recall with current implementation with lower max-conn and then check the search latency. This may also reduce the indexing time, lets see by how much. > @nitirajrathore could you add something to [KnnGraphTester](https://github.com/mikemccand/luceneutil/blob/master/src/main/KnnGraphTester.java) that is a test for connectedness? I have separate scripts for those, I will post the links here and will try to get the code in the luceneutil, although compilation is a issue in that package ( which we are trying to fix separately ). We can think of adding that as optional check in the KnnGraphTester as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org