Hi Yizhou, Yes, this might be causing the failovers. I've seen situations where download of large fsimage from SBNN, plus additional requests to ANN led to longer disk latency, which caused any Service RPC request that require an HDFS WRITE LOCK to take longer to be processed. This can cause failover if all service RPC handlers stay busy for longer than the 45 seconds timeout from FC, so that FC request stay all that time on the queue.
You may be able to confirm on this further by collecting jstack of ANN (you would need a few jstacks from covering the failover period). The pattern in the jstacks would be that all but one RPC service handler thread would be waiting on same lock, while only one would be runnable. You might also want to check for processes blocked message on dmesg output. If there are no messages there, change hung_task_timeout_secs to 40 secs until the next failover, so that you could catch a potential OS pause causing the failover. This may be an indication of file system cache flushes, as described below: https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/ <https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/> Regards, Wellington. > On 26 Apr 2017, at 23:41, Anu Engineer <[email protected]> wrote: > > 1.ANN(active namenode) downloading fsimage.ckpt_* from SNN(standby namenode) > leads to very high disk io, at the same time, zkfc fails to monitor the > health of ann due to timeout. Is there any releationship between high disk io > and zkfc monitor request timeout? Every failover happened when ckpt download, > but not every ckpt download leads to failover.
