Are you checking ZKFC process logs and jstack?
At what stage ZKFC timing out? zk session timing  out? or namenode health
monitoring timing out?


On Thu, Sep 19, 2019 at 9:17 AM Wenqi Ma <[email protected]> wrote:

> HDFS version is 2.7.7
>
> We have 500+ nodes, 230 million files and directories, 270 million blocks,
> 128GB memory for namenode. Recently namenode became unstable, and failed
> over 5-10 times everyday.
>
> According to the jstack, I cannot find any stuck thread. It seems that the
> namenode just cannot handle the requests in time because RUNNABLE threads
> are changed every time I print the jstack. It is like:
> "IPC Server handler 74 on 8020" daemon prio=10 tid=0x00007f5cf4f31000
> nid=0x44c5 runnable [0x00007f3ab2fed000]
>    java.lang.Thread.State: RUNNABLE
>
>     at
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor$BlockIterator.next(DatanodeDescriptor.java:542)
>     at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.getBlocksWithLocations(BlockManager.java:1069)
>     at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.getBlocks(BlockManager.java:1044)
>
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlocks(NameNodeRpcServer.java:481)
>     at
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.getBlocks(NamenodeProtocolServerSideTranslatorPB.java:86)
>     at
> org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12017)
>
> We have 200 rpc hanlders and do not use service-rpc. Is it helpful to
> enable the service-rpc? or any other suggestions?
> Do let me know if you need other information.
> Many thanks.
> --
> Best Regards!
> Wenqi
>
>

Reply via email to