On Fri, May 6, 2016 at 2:43 AM, Terence Yim <[email protected]> wrote:

> Hi Keith,
>
> That seems really strange. Is it always failed to acquire new containers
> after a fixed number of container restarts or is it kind of random? I’ve a
> cluster that has containers restarted over 200 times (due to the runnable
> is using close to its memory limit and get killed by NM occasionally) and
> still running fine.
>

I saw this issue while running test on 20 EC2 nodes.   The next test I run
I will watch for this issue and see if I can learn anything new.


>
> Also, have you tried to use other schedule (e.g. Fair scheduler) to see if
> the you get the same result?
>

No I have not tried other schedulers.

One thing we are doing is enabling the YARN Linux container support.  Its
possible that may be causing this.


> Terence
>
>
> > On May 4, 2016, at 11:57 AM, Keith Turner <[email protected]> wrote:
> >
> > On Wed, May 4, 2016 at 2:14 PM, Terence Yim <[email protected]> wrote:
> >
> >> Hi Keith,
> >>
> >> What is the Hadoop version you are using? Judging from the log, it
> could be
> >> a bug in the Capacity scheduler[1].
> >>
> >
> > I am using Hadoop 2.6.3.  So that bug should be fixed.
> >
> >
> >> Also, have you look at the node manager log of the node
> "worker14:40196"?
> >>
> >
> > No I had not, thats a good idea.  I grepped that log for the yarn app id
> > 1462212200762_0008 and saw nothing pertinent.  I also looked around the
> > time of the error message in the RM and saw nothing pertinent.
> >
> >
> >>
> >> [1] https://issues.apache.org/jira/browse/YARN-2628
> >>
> >> Terence
> >>
> >> On Wed, May 4, 2016 at 8:44 AM, Keith Turner <[email protected]> wrote:
> >>
> >>> I ran into an issue where Yarn does not seem to be starting container
> >> again
> >>> for an application after some containers died.  The details of the
> issue
> >> I
> >>> am running into are outlined in fluo#657 [1].
> >>>
> >>> Twill seems to be trying to restart the containers, but it seems YARN
> is
> >>> not doing it.   Looking at the YARN RM web page there are enough cores
> >> and
> >>> memory available to start the containers, so I am not sure why its not
> >>> starting them.
> >>>
> >>> Does anyone has any tips for debugging this issue or hve a second to
> look
> >>> at the logs attached to fluo#657?
> >>>
> >>> [1] : https://github.com/fluo-io/fluo/issues/657
> >>>
> >>
>
>

Reply via email to