Hi Keith, That seems really strange. Is it always failed to acquire new containers after a fixed number of container restarts or is it kind of random? I’ve a cluster that has containers restarted over 200 times (due to the runnable is using close to its memory limit and get killed by NM occasionally) and still running fine.
Also, have you tried to use other schedule (e.g. Fair scheduler) to see if the you get the same result? Terence > On May 4, 2016, at 11:57 AM, Keith Turner <[email protected]> wrote: > > On Wed, May 4, 2016 at 2:14 PM, Terence Yim <[email protected]> wrote: > >> Hi Keith, >> >> What is the Hadoop version you are using? Judging from the log, it could be >> a bug in the Capacity scheduler[1]. >> > > I am using Hadoop 2.6.3. So that bug should be fixed. > > >> Also, have you look at the node manager log of the node "worker14:40196"? >> > > No I had not, thats a good idea. I grepped that log for the yarn app id > 1462212200762_0008 and saw nothing pertinent. I also looked around the > time of the error message in the RM and saw nothing pertinent. > > >> >> [1] https://issues.apache.org/jira/browse/YARN-2628 >> >> Terence >> >> On Wed, May 4, 2016 at 8:44 AM, Keith Turner <[email protected]> wrote: >> >>> I ran into an issue where Yarn does not seem to be starting container >> again >>> for an application after some containers died. The details of the issue >> I >>> am running into are outlined in fluo#657 [1]. >>> >>> Twill seems to be trying to restart the containers, but it seems YARN is >>> not doing it. Looking at the YARN RM web page there are enough cores >> and >>> memory available to start the containers, so I am not sure why its not >>> starting them. >>> >>> Does anyone has any tips for debugging this issue or hve a second to look >>> at the logs attached to fluo#657? >>> >>> [1] : https://github.com/fluo-io/fluo/issues/657 >>> >>
