nitirajrathore commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1801982741
I was able to conduct some perf-test as well. I would like to propose following changes With example of adding node `a` to 'b, c, d' | node | neighbours| |---|---| | b | c, d, e| | c | b, d| | e | e, x, y, z | ### Proposed heuristic 1. At the time of adding a node always add all (upto maxConn) edges to highest scoring neighbours (disregarding the diversity check) eg. adding edges a-->b, a-->c etc. Currently we check for diverse connections at this time also. 2. At the time of adding reverse edges from neighbours to the node, if the given neighbour has more than maxConn number of connections then use new diversity check formula to calculate the edge that can be removed. For example at the time of adding edge b --> a if we see that `b` has more than maxConn connections then apply new hiuristic and try to remove one edge. 3. Check for diversity only between the common neighbours of candidate node and the node. Example, if `b` has more than maxConn connections then we iterate over b's neighbours (which now includes 'a' as well) considering each one as a possible candidate for disconnection. When deciding if we should disconnect an edge say b-->c or not. We will consider the only the common neighbours of c and b for diversity check, so we will compare the score(c,b) with only score(c,d) and not with score(c,e) as only `d` is the common neighbour of c and b. If we find a non diverse node we return that, otherwise 4. we return a node which has highest number of common neighbours with `b`, otherwise 5. we return a node with maximum number of edges (eg. `e` in above case). 6. If none of the above condition exists, then we return -1, meaning we should not remove any connection. 7. If we receive a valid node for removal, we remove that and the other half of the connection i.e. we remove both bi-directional connections. so if `c` is returned as non-diverse node, then we remove both b---> c and c ---> b connections. Performance number: |Branch| recall | avgCpuTime | numDocs | reindexTimeMsec | fanout | maxConn | beamWidth | totalVisited | selectivity | prefilter | |---| ------ | ---------- | ------- | ------ | ------- | --------- | ------------ | --------------- | ----------- | ----------- | |Baseline | 0.451 | 17.78 | 1000000 | 370916 | 0 | 16 | 100 | 10 | 1.00 | post-filter | |Candidate| 0.599 (33% increase) | 21.63 (22% increase) | 1000000 | 624797 (68% increase) | 0 | 16 | 100 | 10 | 1.00 | post-filter | Since now we are increasing the number of connections per node, we also see a good increase in the recall for same max-conn. For the same reason we do expect the avgCpuTime to increase at query time. I can work a little bit on improving the indexing time as the my current implementation is very naive and finds common nodes by two for loops and some other places recalculates the scores where not needed. But overall the indexing time will increase for sure. Another data that I will post next is the number of (histogram) number of edges each node has. Expect it to increase from lower number to max-conn on average. I will upload my (rough) draft PR now. #### How I reached this heuristic: Before coming up with this heuristic I tried quite some simpler ones, but in those cases I was either getting much much higher disconnections than the mainline or if very less and even just 1 disconnection, but some disconnection was happening. I am open for removing or adding any other condition which is considered for edge removal. Initially, I tried with just (1) and left the rest of the algorithm same as in mainline. This "increased" the disconnections by large factor instead of decreasing it. The issue was that say 0 to 32 nodes are added with star kind of formation, now if 33rd node comes in, it will form connections with 1-32 nodes and each may find 0 as the non diverse node, so each one removes 0 leaving it disconnected from the graph. Other than this extreme case there were others too. That is where (3) came in handy where we don't check diverseness of an edge unless we are sure that there is an alternate path to the disconnecting node via a neighbouring node. This coupled with (7) reduced the disconnectedness from few thousand nodes to just couple of nodes at level-1 and only occassional disconnected node at level-0. Then to achieve no disconnection at all, I initially tried (1, 2, 3) with (6) to remove even slightest chance of creating disconnectedness, but with this we end up not removing any connection a large number of times and the number of connections per node increased from max-conn of 32 at level-0 to upwards of 250 for more than 5% of nodes. To control the number of connections I added (4) and (5) mostly to remove at least some connection without affecting the graph geometry by much. One might wonder if (7) is actually required, so I will run some test without (7). But I think it plays an important by freeing up some of the edges and allowing those nodes to accept un-conditional connections from other incoming node. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org