Hi All,
HDFS HA (Based on QJM) , 5 journalnodes, Apache 2.5.0 on Redhat 6.5 with
JDK1.7.
Put 1P+ data into HDFS with FSimage about 10G, then keep on making more
requests to this HDFS, namenodes failover frequently. Wanna to know something
as follows:
1.ANN(active namenode) downloading fsimage.ckpt_* from SNN(standby
namenode) leads to very high disk io, at the same time, zkfc fails to monitor
the health of ann due to timeout. Is there any releationship between high disk
io and zkfc monitor request timeout? Every failover happened when ckpt
download, but not every ckpt download leads to failover.
2017-03-15 09:27:05,750 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at nn1/ip:8020:
Call From nn1/ip to nn1:8020 failed on socket timeout exception:
java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/ip:48536 remote=nn1/ip:8020] For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
2017-03-15 09:27:05,750 INFO org.apache.hadoop.ha.HealthMonitor: Entering state
SERVICE_NOT_RESPONDING
2.Due to SERVICE_NOT_RESPONDING, another zkfc fences the old ann(configed
sshfence), before restart by my additional monitor, old ann log sometimes shows
like this, what is "Rescan of postponedMisreplicatedBlocks"? Does this have any
reletionships with failover?
2017-03-15 04:36:00,866 INFO
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
2017-03-15 04:36:00,931 INFO
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned
0 directive(s) and 0 block(s) in 65 millisecond(s).
2017-03-15 04:36:01,127 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of
postponedMisreplicatedBlocks completed in 23 msecs. 247361 blocks are left. 0
blocks are removed.
2017-03-15 04:36:04,145 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of
postponedMisreplicatedBlocks completed in 17 msecs. 247361 blocks are left. 0
blocks are removed.
2017-03-15 04:36:07,159 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of
postponedMisreplicatedBlocks completed in 14 msecs. 247361 blocks are left. 0
blocks are removed.
2017-03-15 04:36:10,173 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of
postponedMisreplicatedBlocks completed in 14 msecs. 247361 blocks are left. 0
blocks are removed.
2017-03-15 04:36:13,188 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of
postponedMisreplicatedBlocks completed in 14 msecs. 247361 blocks are left. 0
blocks are removed.
2017-03-15 04:36:16,211 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of
postponedMisreplicatedBlocks completed in 23 msecs. 247361 blocks are left. 0
blocks are removed.
2017-03-15 04:36:19,234 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of
postponedMisreplicatedBlocks completed in 22 msecs. 247361 blocks are left. 0
blocks are removed.
2017-03-15 04:36:28,994 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
STARTUP_MSG:
3.I config two dfs.namenode.name.dir and one
dfs.journalnode.edits.dir(which shares one disk with nn), is it suitable? Or
does this have any disadvantage?
<property>
<name>dfs.namenode.name.dir.nameservice.nn1</name>
<value>/data1/hdfs/dfs/name,/data2/hdfs/dfs/name</value>
</property>
<property>
<name>dfs.namenode.name.dir.nameservice.nn2</name>
<value>/data1/hdfs/dfs/name,/data2/hdfs/dfs/name</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/data1/hdfs/dfs/journal</value>
</property>
4.Interested in design of checkpoint and edit logs transmission,any
explanation,issues or documents?
Thanks in advance,
Doris
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]