[jira] [Comment Edited] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

Ajay Kumar (JIRA) Mon, 26 Mar 2018 15:00:54 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414630#comment-16414630
 ]


Ajay Kumar edited comment on HADOOP-15317 at 3/26/18 9:59 PM:
--------------------------------------------------------------

[~xiaochen], thanks for updating the patch. New change to handle the best case 
increases the probability of initial nodes being chosen. This results in 
sporadic failure of new test cases. 

{code}
Test failure
java.lang.AssertionError: excludedNodes: [5.5.5.5:9866, 2.2.2.2:9866, 
3.3.3.3:9866]
result:{19.19.19.19:9866=0, 10.10.10.10:9866=0, 17.17.17.17:9866=0, 
12.12.12.12:9866=0, 9.9.9.9:9866=0,
11.11.11.11:9866=0, 6.6.6.6:9866=0, 1.1.1.1:9866=100, 20.20.20.20:9866=0, 
4.4.4.4:9866=0, 5.5.5.5:9866=0,
2.2.2.2:9866=0, 8.8.8.8:9866=0, 14.14.14.14:9866=0, 3.3.3.3:9866=0, 
7.7.7.7:9866=0, 13.13.13.13:9866=0,
18.18.18.18:9866=0, 15.15.15.15:9866=0, 16.16.16.16:9866=0}

        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at 
org.apache.hadoop.net.TestNetworkTopology.verifyResults(TestNetworkTopology.java:523)
        at 
org.apache.hadoop.net.TestNetworkTopology.testChooseRandomInclude1(TestNetworkTopology.java:494)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{code}
This is result of bounding next int to available nodes 
{{r.nextInt(availableNodes);}} as availableNodes <= 
parentNode.getNumOfLeaves().  One way to avoid this is to choose nextInt twice 
as suggested in my last comment.
For debugging purpose it would be good to include {{excludedNodes}} in assert 
conditions. {{L521/523 TestNetworkTopology#verifyResults}}


was (Author: ajayydv):
[~xiaochen], thanks for updating the patch. New change to handle the best case 
increases the probability of initial nodes being chosen. This results in 
sporadic failure of new test cases. This is result of bounding next int to 
available nodes {{r.nextInt(availableNodes);}} as availableNodes <= 
parentNode.getNumOfLeaves(). 

{code}
Test failure
java.lang.AssertionError: excludedNodes: [5.5.5.5:9866, 2.2.2.2:9866, 
3.3.3.3:9866]
result:{19.19.19.19:9866=0, 10.10.10.10:9866=0, 17.17.17.17:9866=0, 
12.12.12.12:9866=0, 9.9.9.9:9866=0,
11.11.11.11:9866=0, 6.6.6.6:9866=0, 1.1.1.1:9866=100, 20.20.20.20:9866=0, 
4.4.4.4:9866=0, 5.5.5.5:9866=0,
2.2.2.2:9866=0, 8.8.8.8:9866=0, 14.14.14.14:9866=0, 3.3.3.3:9866=0, 
7.7.7.7:9866=0, 13.13.13.13:9866=0,
18.18.18.18:9866=0, 15.15.15.15:9866=0, 16.16.16.16:9866=0}

        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at 
org.apache.hadoop.net.TestNetworkTopology.verifyResults(TestNetworkTopology.java:523)
        at 
org.apache.hadoop.net.TestNetworkTopology.testChooseRandomInclude1(TestNetworkTopology.java:494)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{code}

One way to avoid this is to choose nextInt twice as suggested in my last 
comment.
For debugging purpose it would be good to include {{excludedNodes}} in assert 
conditions. {{L521/523 TestNetworkTopology#verifyResults}}

> Improve NetworkTopology chooseRandom's loop
> -------------------------------------------
>
>                 Key: HADOOP-15317
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15317
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>            Priority: Major
>         Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

Reply via email to