[jira] [Comment Edited] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

Ajay Kumar (JIRA) Mon, 26 Mar 2018 15:02:46 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414630#comment-16414630
 ]


Ajay Kumar edited comment on HADOOP-15317 at 3/26/18 10:01 PM:
---------------------------------------------------------------

[~xiaochen], thanks for updating the patch. New change to handle the best case 
increases the probability of initial nodes being chosen. This results in 
sporadic failure of new test cases. 

{code}
Test failure
java.lang.AssertionError: excludedNodes: [5.5.5.5:9866, 2.2.2.2:9866, 
3.3.3.3:9866]
result:{19.19.19.19:9866=0, 10.10.10.10:9866=0, 17.17.17.17:9866=0, 
12.12.12.12:9866=0, 9.9.9.9:9866=0,
11.11.11.11:9866=0, 6.6.6.6:9866=0, 1.1.1.1:9866=100, 20.20.20.20:9866=0, 
4.4.4.4:9866=0, 5.5.5.5:9866=0,
2.2.2.2:9866=0, 8.8.8.8:9866=0, 14.14.14.14:9866=0, 3.3.3.3:9866=0, 
7.7.7.7:9866=0, 13.13.13.13:9866=0,
18.18.18.18:9866=0, 15.15.15.15:9866=0, 16.16.16.16:9866=0}

        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at 
org.apache.hadoop.net.TestNetworkTopology.verifyResults(TestNetworkTopology.java:523)
        at 
org.apache.hadoop.net.TestNetworkTopology.testChooseRandomInclude1(TestNetworkTopology.java:494)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{code}
This is result of bounding next int to available nodes 
{{r.nextInt(availableNodes);}} as availableNodes <= 
parentNode.getNumOfLeaves().  One way to avoid this is to choose nextInt twice 
as suggested in my last comment.
For debugging purpose it would be good to include {{excludedNodes}} in assert 
conditions. {{L521/523 TestNetworkTopology#verifyResults}}


was (Author: ajayydv):
[~xiaochen], thanks for updating the patch. New change to handle the best case 
increases the probability of initial nodes being chosen. This results in 
sporadic failure of new test cases. 

{code}
Test failure
java.lang.AssertionError: excludedNodes: [5.5.5.5:9866, 2.2.2.2:9866, 
3.3.3.3:9866]
result:{19.19.19.19:9866=0, 10.10.10.10:9866=0, 17.17.17.17:9866=0, 
12.12.12.12:9866=0, 9.9.9.9:9866=0,
11.11.11.11:9866=0, 6.6.6.6:9866=0, 1.1.1.1:9866=100, 20.20.20.20:9866=0, 
4.4.4.4:9866=0, 5.5.5.5:9866=0,
2.2.2.2:9866=0, 8.8.8.8:9866=0, 14.14.14.14:9866=0, 3.3.3.3:9866=0, 
7.7.7.7:9866=0, 13.13.13.13:9866=0,
18.18.18.18:9866=0, 15.15.15.15:9866=0, 16.16.16.16:9866=0}

        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at 
org.apache.hadoop.net.TestNetworkTopology.verifyResults(TestNetworkTopology.java:523)
        at 
org.apache.hadoop.net.TestNetworkTopology.testChooseRandomInclude1(TestNetworkTopology.java:494)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{code}
This is result of bounding next int to available nodes 
{{r.nextInt(availableNodes);}} as availableNodes <= 
parentNode.getNumOfLeaves().  One way to avoid this is to choose nextInt twice 
as suggested in my last comment.
For debugging purpose it would be good to include {{excludedNodes}} in assert 
conditions. {{L521/523 TestNetworkTopology#verifyResults}}

> Improve NetworkTopology chooseRandom's loop
> -------------------------------------------
>
>                 Key: HADOOP-15317
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15317
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>            Priority: Major
>         Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

Reply via email to