nitirajrathore commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1814046690

   > What if we added an "incoming connection" count for every node?
   
   &
   
   >  I think this idea would prevent the isolated nodes, but not fix the other 
case.
   
   I will write a small code to check the type of disconnectedness. Clustered 
vs single
   
   > but then we (I think?) decided that wasn't the approach advocated by the 
academic papers we were emulating, and implemented the diversity check to every 
added neighbor
   
   Paper proposes 3 different heuristics. 
   1. check diversity before adding any node. -- `Standard` in paper.  
([getNeighborsByHeuristic2()](https://github.com/nmslib/nmslib/blob/master/similarity_search/include/method/hnsw.h#L130)
 in the code)
   2. check diversity before adding any node, but if the nodes added are lesser 
than the max-conn then add atleast max-conn nodes. `keepPrunedConnections` in 
paper 
([getNeighborsByHeuristic1()](https://github.com/nmslib/nmslib/blob/master/similarity_search/include/method/hnsw.h#L83C34-L83C34)
 in the code)
   3. add all the neighbours of neighbours in the candidate list before finding 
the diverse nodes as in (1) and optionally apply `keepPrunedConnections` as 
well. Called `extendCandidates` in paper. ( Not sure but, I guess this is the 
[getNeighborsByHeuristic3()](https://github.com/nmslib/nmslib/blob/master/similarity_search/include/method/hnsw.h#L171)
 in the code. but need to understand it a bit.)
   
   Our current implementation is like (1) with some optimizations for 
performance. My proposed approach is similar to (2) with more strict conditions 
by checking diversity only for nodes which are surely connected via their 
neighbours. 
   I earlier tried to implement (2) just by keeping the pruned connections till 
max-conn. As I mentioned, this increased the disconnected nodes by a factor 
instead of decreasing it. The reason I understood was that now there was more 
competition between nodes as every node now had max-conn number of connections 
all the time, triggering the `findWorstNonDiverse()` to be called every time 
for every neighbour, and most of the time the just now newly formed edge will 
be non-diverse and will get dropped. In this I think there might be something 
wrong in the implementation of `checked` and `unchecked` checking that we are 
doing in `findWorstNonDiverse()`. I am not able to get my head around that 
checked-unchecked code completely, but I think it give green signal to all the 
current neighbours as diverse incorrectly and checking just the 'unchecked' 
connections triggers the diversity check only for the newly added edge which it 
founds to be non-diverse and removes it.
   
   I think there is huge difference between the number of connections between 
heuristic (1) and (2), and by inference between the current implementation and 
my proposed implementation, so I will post numbers around that. But I don't 
think that is necessarily bad, if the number of avg connections are increasing 
it is also increasing the recall with ExactKNN, meaning that we can get same 
recall for a smaller max-conn value now. I will also try to get some number 
around same recall with current implementation with lower max-conn and then 
check the search latency. This may also reduce the indexing time, lets see by 
how much.
   
   > @nitirajrathore could you add something to 
[KnnGraphTester](https://github.com/mikemccand/luceneutil/blob/master/src/main/KnnGraphTester.java)
 that is a test for connectedness?
   
   I have separate scripts for those, I will post the links here and will try 
to get the code in the luceneutil, although compilation is a issue in that 
package ( which we are trying to fix separately ). We can think of adding that 
as optional check in the KnnGraphTester as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to