Figured what was causing the bottleneck. Realized the following parameters
are very important for scheduling in large clusters or clusters with beefy
nodes.

Following properties in yarn-site helped job throughput:
- yarn.scheduler.fair.continuous-scheduling-enabled = true  : Spins off a
thread dedicated to assigning containers to app attempts.
- yarn.scheduler.fair.assignmultiple = true : Allows multiple containers to
be assigned on each scheduling attempt.

This speeds up scheduler performance considerably and more importantly
reduces uncertainty and noise in scheduling frequency. Surprisingly, these
didn't show up  on any Hadoop presentations, docs or the usual blogs, so
hopefully this is useful for someone else.

-Gautam.



On Tue, Oct 25, 2016 at 8:09 PM Gautam <[email protected]> wrote:

> Hello Mighty Hadoop Users,
>                                           We'v been running into
> applications getting bottlenecked (MR/Tez) now and then. Apps get stuck in
> the ACCEPTED state and take random times to reach RUNNING. Our cluster is
> not particularly at peak load capacity wise but might be related to sudden
> submission of applications.
>
> Scenario that I'm concerned about and trying to fix/optimize:
>  - Applications start piling up in ACCEPTED state. App gets submitted,
>  transitions  from SUBMITTED to ACCEPTED.  Remains here for 5mins or 10
> mins or even 30 mins in many cases doing nothing.
>  - Queue of this app, at the time, has available capacity during this
> time.
>  - There is no user-limit configured. We use fair-share scheduler so I
> don't think a default user limit is applied. *Please correct me if i'm
> wrong*
>  - Suddenly get's into RUNNING and finishes as usual.
>
> We use Hadoop 2.6.0 (cdh5.7.4), most concerned configurations are default.
> These are all Mapreduce and Tez jobs. I tried
> increasing yarn.resourcemanager.scheduler.client.thread-count=100
> and yarn.resourcemanager.amlauncher.thread-count=100 but didn't help.
>
> I have attached the RM debug log (filtered by app that was stuck for 11
> mins) and NM log for the AM of that app. Would like to know what tuning can
> help with this.
>
> Much Appreciated,
> -Gautam.
>

Reply via email to