[
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582806#comment-14582806
]
lvchuanwen commented on HADOOP-10251:
-------------------------------------
######################################hdfs-nn2-zkfc-host196.log
file#######################################
2015-06-12 02:25:54,146 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2015-06-12 02:25:54,147 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists: 0a0a636c757374657231393512036e6e311a067a646831393520a84628d33e
2015-06-12 02:25:54,149 INFO org.apache.hadoop.ha.ZKFailoverController: Should
fence: NameNode at zdh195/10.43.156.195:9000
2015-06-12 02:25:55,151 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:25:55,152 WARN org.apache.hadoop.ha.FailoverController: Unable to
gracefully make NameNode at zdh195/10.43.156.195:9000 standby (unable to
connect)
java.net.ConnectException: Call From zdh196/10.43.156.196 to zdh195:9000 failed
on connection exception: java.net.ConnectException: 拒绝连接; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor24.newInstance(Unknown
Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy12.transitionToStandby(Unknown Source)
at
org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
at
org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:516)
at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:507)
at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:894)
at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
Caused by: java.net.ConnectException: 拒绝连接
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
at org.apache.hadoop.ipc.Client.call(Client.java:1382)
... 14 more
2015-06-12 02:25:55,153 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning
Service Fencing Process... ======
2015-06-12 02:25:55,154 INFO org.apache.hadoop.ha.NodeFencer: Trying method
1/2: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2015-06-12 02:25:55,157 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting
to zdh195...
2015-06-12 02:25:55,157 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Connecting to zdh195 port 22
2015-06-12 02:25:55,159 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Connection established
2015-06-12 02:25:55,188 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Remote version string: SSH-2.0-OpenSSH_5.3
2015-06-12 02:25:55,188 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local
version string: SSH-2.0-JSCH-0.1.42
2015-06-12 02:25:55,189 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
CheckCiphers:
aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
2015-06-12 02:25:55,194 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes256-ctr is not available.
2015-06-12 02:25:55,194 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes192-ctr is not available.
2015-06-12 02:25:55,194 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes256-cbc is not available.
2015-06-12 02:25:55,194 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes192-cbc is not available.
2015-06-12 02:25:55,194 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
arcfour256 is not available.
2015-06-12 02:25:55,194 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXINIT sent
2015-06-12 02:25:55,194 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXINIT received
2015-06-12 02:25:55,194 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex:
server->client aes128-ctr hmac-md5 none
2015-06-12 02:25:55,194 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex:
client->server aes128-ctr hmac-md5 none
2015-06-12 02:25:55,197 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXDH_INIT sent
2015-06-12 02:25:55,197 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
expecting SSH_MSG_KEXDH_REPLY
2015-06-12 02:25:55,205 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
ssh_rsa_verify: signature true
2015-06-12 02:25:55,205 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Permanently added 'zdh195' (RSA) to the list of known hosts.
2015-06-12 02:25:55,205 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_NEWKEYS sent
2015-06-12 02:25:55,205 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_NEWKEYS received
2015-06-12 02:25:55,206 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_SERVICE_REQUEST sent
2015-06-12 02:25:55,207 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_SERVICE_ACCEPT received
2015-06-12 02:25:55,209 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentications that can continue:
gssapi-with-mic,publickey,keyboard-interactive,password
2015-06-12 02:25:55,209 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next
authentication method: gssapi-with-mic
2015-06-12 02:25:55,215 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentications that can continue: publickey,keyboard-interactive,password
2015-06-12 02:25:55,215 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next
authentication method: publickey
2015-06-12 02:25:55,263 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentication succeeded (publickey).
2015-06-12 02:25:55,263 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected
to zdh195
2015-06-12 02:25:55,263 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking
for process running on port 9000
2015-06-12 02:25:57,418 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
Indeterminate response from trying to kill service. Verifying whether it is
running using nc...
2015-06-12 02:25:59,266 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Verified
that the service is down.
2015-06-12 02:25:59,266 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Disconnecting from zdh195 port 22
2015-06-12 02:25:59,267 INFO org.apache.hadoop.ha.NodeFencer: ====== Fencing
successful by method org.apache.hadoop.ha.SshFenceByTcpPort(null) ======
2015-06-12 02:25:59,267 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing
znode /hadoop-ha/cluster195/ActiveBreadCrumb to indicate that the local node is
the most recent active...
2015-06-12 02:25:59,267 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Caught an exception, leaving main loop due to Socket closed
2015-06-12 02:25:59,271 INFO org.apache.hadoop.ha.ZKFailoverController: Trying
to make NameNode at zdh196/10.43.156.196:9000 active...
2015-06-12 02:26:00,596 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at zdh196/10.43.156.196:9000 to active state
2015-06-12 02:26:02,608 ERROR org.apache.hadoop.ha.ZKFailoverController: Local
service NameNode at zdh196/10.43.156.196:9000 has changed the serviceState to
active. Expected was standby. Quitting election marking fencing necessary.
2015-06-12 02:26:02,608 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Yielding from election
2015-06-12 02:26:02,614 INFO org.apache.zookeeper.ZooKeeper: Session:
0x24d91acb5cb1c4f closed
2015-06-12 02:26:02,615 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x24d91acb5cb1c4f
2015-06-12 02:26:02,615 INFO org.apache.zookeeper.ClientCnxn: EventThread shut
down
2015-06-12 02:26:03,619 INFO org.apache.zookeeper.ZooKeeper: Initiating client
connection, connectString=zdh196:2181,zdh195:2181,zdh197:2181
sessionTimeout=10000
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@33f89c4e
2015-06-12 02:26:03,622 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server zdh195/10.43.156.195:2181. Will not attempt to
authenticate using SASL (unknown error)
2015-06-12 02:26:03,623 INFO org.apache.zookeeper.ClientCnxn: Socket connection
established to zdh195/10.43.156.195:2181, initiating session
2015-06-12 02:26:03,626 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server zdh195/10.43.156.195:2181, sessionid =
0x24d91acb5cb1c50, negotiated timeout = 10000
2015-06-12 02:26:03,627 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session
connected.
2015-06-12 02:26:03,629 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2015-06-12 02:26:03,631 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists: 0a0a636c757374657231393512036e6e321a067a646831393620a84628d33e
2015-06-12 02:26:03,631 INFO org.apache.hadoop.ha.ActiveStandbyElector: But old
node has our own data, so don't need to fence it.
2015-06-12 02:26:03,631 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing
znode /hadoop-ha/cluster195/ActiveBreadCrumb to indicate that the local node is
the most recent active...
2015-06-12 02:26:03,635 INFO org.apache.hadoop.ha.ZKFailoverController: Trying
to make NameNode at zdh196/10.43.156.196:9000 active...
2015-06-12 02:26:03,638 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at zdh196/10.43.156.196:9000 to active state
2015-06-12 02:28:03,996 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
zdh196/10.43.156.196:9000: Failed on local exception: java.io.EOFException;
Host Details : local host is: "zdh196/10.43.156.196"; destination host is:
"zdh196":9000;
2015-06-12 02:28:03,997 INFO org.apache.hadoop.ha.HealthMonitor: Entering state
SERVICE_NOT_RESPONDING
2015-06-12 02:28:03,997 INFO org.apache.hadoop.ha.ZKFailoverController: Local
service NameNode at zdh196/10.43.156.196:9000 entered state:
SERVICE_NOT_RESPONDING
2015-06-12 02:28:03,997 INFO org.apache.hadoop.ha.ZKFailoverController:
Quitting master election for NameNode at zdh196/10.43.156.196:9000 and marking
that fencing is necessary
2015-06-12 02:28:03,997 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Yielding from election
2015-06-12 02:28:04,000 INFO org.apache.zookeeper.ZooKeeper: Session:
0x24d91acb5cb1c50 closed
2015-06-12 02:28:04,001 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x24d91acb5cb1c50
2015-06-12 02:28:04,001 INFO org.apache.zookeeper.ClientCnxn: EventThread shut
down
2015-06-12 02:28:06,002 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: zdh196/10.43.156.196:9000. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:28:06,004 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
zdh196/10.43.156.196:9000: Call From zdh196/10.43.156.196 to zdh196:9000 failed
on connection exception: java.net.ConnectException: 拒绝连接; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
...
...
...
2015-06-12 02:30:26,197 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: zdh196/10.43.156.196:9000. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:30:26,198 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
zdh196/10.43.156.196:9000: Call From zdh196/10.43.156.196 to zdh196:9000 failed
on connection exception: java.net.ConnectException: 拒绝连接; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
2015-06-12 02:30:28,199 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: zdh196/10.43.156.196:9000. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:30:28,200 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
zdh196/10.43.156.196:9000: Call From zdh196/10.43.156.196 to zdh196:9000 failed
on connection exception: java.net.ConnectException: 拒绝连接; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
> Both NameNodes could be in STANDBY State if SNN network is unstable
> -------------------------------------------------------------------
>
> Key: HADOOP-10251
> URL: https://issues.apache.org/jira/browse/HADOOP-10251
> Project: Hadoop Common
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.2.0
> Reporter: Vinayakumar B
> Assignee: Vinayakumar B
> Priority: Critical
> Fix For: 2.5.0
>
> Attachments: HADOOP-10251.patch, HADOOP-10251.patch,
> HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch
>
>
> Following corner scenario happened in one of our cluster.
> 1. NN1 was Active and NN2 was Standby
> 2. NN2 machine's network was slow
> 3. NN1 got shutdown.
> 4. NN2 ZKFC got the notification and trying to check for old active for
> fencing. (This took little more time, again due to slow network)
> 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made
> it Active.
> 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to
> STANBY.
> 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and
> got shutdown before making NN2 Active.
> *Now cluster having both NameNodes as STANDBY.*
> NN1 ZKFC still thinks that its nameNode is in Active state.
> NN2 ZKFC waiting for election.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)