Thank you Ravi for your reply. I found one parameter 'yarn.resourcemanager.nm.liveness-monitor.interval-ms' (default value=1000ms) in yarn-default.xml (v2.4.1) which determines how often to check that node managers are still alive. So RM is checking heartbeat of NM every second but it takes 10 min to decide whether the NM is dead or not. (yarn.nm.liveness-monitor.expiry-interval-ms: How long to wait until a node manager is considered dead; default value = 600000 ms).
What happens if RM finds that one NM's heartbeat is missing but it is not 10 min yet (yarn.nm.liveness-monitor.expiry-interval-ms time is not expired yet) Will a new application still make container request to that NM via RM? Thanks Tanvir On Wed, Nov 2, 2016 at 5:41 PM, Ravi Prakash <[email protected]> wrote: > Hi Tanvir! > > Its hard to have some configuration that works for all cluster scenarios. > I suspect that value was chosen as somewhat a mirror of the time it takes > HDFS to realize a datanode is dead (which is also 10 mins from what I > remember). The RM also has to reschedule the work when that timeout > expires. Also there may be network glitches which could last that > long...... Also, the NMs are pretty stable by themselves. Failing NMs have > not been too common in my experience. > > HTH > Ravi > > On Wed, Nov 2, 2016 at 10:44 AM, Tanvir Rahman <[email protected]> > wrote: > >> Hello, >> Can anyone please tell me why the default value of ' >> yarn.resourcemanager.container.liveness-monitor.interval-ms' in >> yarn-default.xml >> <https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml> >> is >> so high? This parameter determines "How often to check that containers >> are still alive". The default value is 60000 ms or 10 minutes. So if a >> node manager fails, the resource manager detects the dead container after >> 10 minutes. >> >> >> I am running a wordcount code in my university cluster. In the middle of >> run, I stopped node manager of one node (the data node is still running) >> and found that the completion time increases about 10 minutes because of >> the node manager failure. >> >> Thanks in advance >> Tanvir >> >> >>> >> >
