On Fri, May 6, 2016 at 2:43 AM, Terence Yim <[email protected]> wrote:
> Hi Keith, > > That seems really strange. Is it always failed to acquire new containers > after a fixed number of container restarts or is it kind of random? I’ve a > cluster that has containers restarted over 200 times (due to the runnable > is using close to its memory limit and get killed by NM occasionally) and > still running fine. > I saw this issue while running test on 20 EC2 nodes. The next test I run I will watch for this issue and see if I can learn anything new. > > Also, have you tried to use other schedule (e.g. Fair scheduler) to see if > the you get the same result? > No I have not tried other schedulers. One thing we are doing is enabling the YARN Linux container support. Its possible that may be causing this. > Terence > > > > On May 4, 2016, at 11:57 AM, Keith Turner <[email protected]> wrote: > > > > On Wed, May 4, 2016 at 2:14 PM, Terence Yim <[email protected]> wrote: > > > >> Hi Keith, > >> > >> What is the Hadoop version you are using? Judging from the log, it > could be > >> a bug in the Capacity scheduler[1]. > >> > > > > I am using Hadoop 2.6.3. So that bug should be fixed. > > > > > >> Also, have you look at the node manager log of the node > "worker14:40196"? > >> > > > > No I had not, thats a good idea. I grepped that log for the yarn app id > > 1462212200762_0008 and saw nothing pertinent. I also looked around the > > time of the error message in the RM and saw nothing pertinent. > > > > > >> > >> [1] https://issues.apache.org/jira/browse/YARN-2628 > >> > >> Terence > >> > >> On Wed, May 4, 2016 at 8:44 AM, Keith Turner <[email protected]> wrote: > >> > >>> I ran into an issue where Yarn does not seem to be starting container > >> again > >>> for an application after some containers died. The details of the > issue > >> I > >>> am running into are outlined in fluo#657 [1]. > >>> > >>> Twill seems to be trying to restart the containers, but it seems YARN > is > >>> not doing it. Looking at the YARN RM web page there are enough cores > >> and > >>> memory available to start the containers, so I am not sure why its not > >>> starting them. > >>> > >>> Does anyone has any tips for debugging this issue or hve a second to > look > >>> at the logs attached to fluo#657? > >>> > >>> [1] : https://github.com/fluo-io/fluo/issues/657 > >>> > >> > >
