Your disk seems to be an issue, which is causing Journal node timeout.
Do, benchmarks on the disks for namenode, zk and JQM
On 3/12/18 2:08 pm, 白 瑶瑶 wrote:
------------------------------------------------------------------------
*发件人:* 白 瑶瑶 代表 白 瑶瑶 <[email protected]>
*发送时间:* 2018年10月18日 10:33
*主题:* namdenode question consultation and advice
Hi :
My production Hadoop cluster (HA) has recently had a problem with two
namenode hanging up frequently, causing errors that I couldn't
resolve,The same is true of the namenode in the active state when the
following error occurs after the crash, and the namenode in the
standby state cannot be switched. The error is as follows:
2018-10-18 15:51:36,311 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log
from 10.117.29.24
2018-10-18 15:51:36,311 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2018-10-18 15:51:36,311 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment
3420935
2018-10-18 15:51:38,738 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of
transactions: 19 Total time for transactions(ms): 2 Number of
transactions batched in Syncs: 0 Number of syncs: 10 SyncTimes(ms):
180 2525
2018-10-18 15:51:38,765 INFO
org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing
edits file
/data/hadoop/tmp/dfs/name/current/edits_inprogress_0000000000003420935
->
/data/hadoop/tmp/dfs/name/current/edits_0000000000003420935-0000000000003420953
2018-10-18 15:51:38,765 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment
at 3420954
2018-10-18 15:51:44,767 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
6001 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:45,768 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
7002 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:46,769 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
8003 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:47,770 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
9004 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:48,771 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
10005 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:49,771 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
11006 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:50,773 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
12007 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:51,774 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
13008 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:52,774 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
14009 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:53,776 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
15010 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:54,777 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
16011 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:55,778 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
17013 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:56,780 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
18014 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:57,781 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
19015 ms (timeout=20000 ms) for a response for
startLogSegment(3420954). Succeeded so far: [10.117.29.25:8485]
2018-10-18 15:51:58,767 FATAL
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log
segment 3420954 failed for required journal (JournalAndStream(mgr=QJM
to [10.117.29.25:8485, 10.117.29.24:8485, 10.117.29.23:8485],
stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes
to respond.
at
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
at
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
at
org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
at
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at
org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1237)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1206)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1300)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:5836)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1122)
at
org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)
at
org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
2018-10-18 15:51:58,768 INFO org.apache.hadoop.util.ExitUtil: Exiting
with status 1
2018-10-18 15:51:58,773 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at kvmserver25/10.117.29.25
************************************************************/
2018-10-18 16:04:13,143 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = kvmserver25/10.117.29.25
I want to ask, under what circumstances will this mistake occur, or
what good suggestions do you have?
thank you.
BAI