nitirajrathore commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1801982741

   I was able to conduct some perf-test as well. I would like to propose 
following changes
   With example of adding node `a` to 'b, c, d'
   
   | node | neighbours|
   |---|---|
   | b | c, d, e|
   | c | b, d|
   | e | e, x, y, z |
   
   ### Proposed heuristic 
   1. At the time of adding a node always add all (upto maxConn) edges to 
highest scoring neighbours (disregarding the diversity check) eg. adding edges 
a-->b, a-->c etc. Currently we check for diverse connections at this time also.
   2. At the time of adding reverse edges from neighbours to the node, if the 
given neighbour has more than maxConn number of connections then use new 
diversity check formula to calculate the edge that can be removed. For example 
at the time of adding edge b --> a if we see that `b` has more than maxConn 
connections then apply new hiuristic and try to remove one edge.
   3. Check for diversity only between the common neighbours of candidate node 
and the node. Example, if `b` has more than maxConn connections then we iterate 
over b's neighbours (which now includes 'a' as well) considering each one as a 
possible candidate for disconnection. When deciding if we should disconnect an 
edge say b-->c or not. We will consider the only the common neighbours of c and 
b for diversity check, so we will compare the score(c,b) with only score(c,d) 
and not with score(c,e) as only `d` is the common neighbour of c and b. If we 
find a non diverse node we return that, otherwise
   4. we return a node which has highest number of common neighbours with `b`, 
otherwise
   5. we return a node with maximum number of edges (eg. `e` in above case).
   6. If none of the above condition exists, then we return -1, meaning we 
should not remove any connection.
   7. If we receive a valid node for removal, we remove that and the other half 
of the connection i.e. we remove both bi-directional connections. so if `c` is 
returned as non-diverse node, then we remove both b---> c and c ---> b 
connections.
   
   Performance number:
   |Branch| recall | avgCpuTime | numDocs | reindexTimeMsec | fanout | maxConn 
| beamWidth | totalVisited | selectivity | prefilter   |
   |---| ------ | ---------- | ------- | ------ | ------- | --------- | 
------------ | --------------- | ----------- | ----------- |
   |Baseline | 0.451  | 17.78      | 1000000 | 370916  | 0      | 16      | 100 
      | 10                   | 1.00        | post-filter |
   |Candidate| 0.599 (33% increase)  | 21.63 (22% increase)     | 1000000 | 
624797 (68% increase)  | 0      | 16      | 100       | 10                   | 
1.00        | post-filter |
   
   Since now we are increasing the number of connections per node, we also see 
a good increase in the recall for same max-conn. For the same reason we do 
expect the avgCpuTime to increase at query time. I can work a little bit on 
improving the indexing time as the my current implementation is very naive and 
finds common nodes by two for loops and some other places recalculates the 
scores where not needed. But overall the indexing time will increase for sure.
   
   Another data that I will post next is the number of (histogram) number of 
edges each node has. Expect it to increase from lower number to max-conn on 
average. 
   
   I will upload my (rough) draft PR now.
   
   #### How I reached this heuristic:
   Before coming up with this heuristic I tried quite some simpler ones, but in 
those cases I was either getting much much higher disconnections than the 
mainline or if very less and even just 1 disconnection, but some disconnection 
was happening. I am open for removing or adding any other condition which is 
considered for edge removal.
   
   Initially, I tried with just (1) and left the rest of the algorithm same as 
in mainline. This "increased" the disconnections by large factor instead of 
decreasing it. The issue was that say 0 to 32 nodes are added with star kind of 
formation, now if 33rd node comes in, it will form connections with 1-32 nodes 
and each may find 0 as the non diverse node, so each one removes 0 leaving it 
disconnected from the graph. Other than this extreme case there were others 
too. 
   That is where (3) came in handy where we don't check diverseness of an edge 
unless we are sure that there is an alternate path to the disconnecting node 
via a neighbouring node. This coupled with (7) reduced the disconnectedness 
from few thousand nodes to just couple of nodes at level-1 and only occassional 
disconnected node at level-0.
   Then to achieve no disconnection at all, I initially tried (1, 2, 3) with 
(6) to remove even slightest chance of creating disconnectedness, but with this 
we end up not removing any connection a large number of times and the number of 
connections per node increased from max-conn of 32 at level-0 to upwards of 250 
for more than 5% of nodes. 
   To control the number of connections I added (4) and (5) mostly to remove at 
least some connection without affecting the graph geometry by much.
   One might wonder if (7) is actually required, so I will run some test 
without (7). But I think it plays an important by freeing up some of the edges 
and allowing those nodes to accept un-conditional connections from other 
incoming node.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to