Hi, I'm working with a few clusters of 100+ nodes and I've been wondering how exactly the failover, as well as a cold start, works in respect to the block reports.
I sometimes see failover times of 15-45 minutes waiting in the safe mode for all blocks to report in. Datanodes usually send a report every six hours I believe, so there must be something else going on. How are Datanodes informed of the new Namenode? How do they know that they should send a full block report (assuming this is what happens)? -> I assume the answer to both lies in Heartbeats? Are there any guidelines on how long recovery should take and are there any options that can be used to decrease the time? Thank you!
