Are you checking ZKFC process logs and jstack? At what stage ZKFC timing out? zk session timing out? or namenode health monitoring timing out?
On Thu, Sep 19, 2019 at 9:17 AM Wenqi Ma <[email protected]> wrote: > HDFS version is 2.7.7 > > We have 500+ nodes, 230 million files and directories, 270 million blocks, > 128GB memory for namenode. Recently namenode became unstable, and failed > over 5-10 times everyday. > > According to the jstack, I cannot find any stuck thread. It seems that the > namenode just cannot handle the requests in time because RUNNABLE threads > are changed every time I print the jstack. It is like: > "IPC Server handler 74 on 8020" daemon prio=10 tid=0x00007f5cf4f31000 > nid=0x44c5 runnable [0x00007f3ab2fed000] > java.lang.Thread.State: RUNNABLE > > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor$BlockIterator.next(DatanodeDescriptor.java:542) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.getBlocksWithLocations(BlockManager.java:1069) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.getBlocks(BlockManager.java:1044) > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlocks(NameNodeRpcServer.java:481) > at > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.getBlocks(NamenodeProtocolServerSideTranslatorPB.java:86) > at > org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12017) > > We have 200 rpc hanlders and do not use service-rpc. Is it helpful to > enable the service-rpc? or any other suggestions? > Do let me know if you need other information. > Many thanks. > -- > Best Regards! > Wenqi > >
