Thanks for contributing back your findings @Gautam Best,
-- Iain Wright This email message is confidential, intended only for the recipient(s) named above and may contain information that is privileged, exempt from disclosure under applicable law. If you are not the intended recipient, do not disclose or disseminate the message to anyone except the intended recipient. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender by return email, and delete all copies of this message. On Wed, Oct 26, 2016 at 2:02 PM, Gautam <[email protected]> wrote: > Figured what was causing the bottleneck. Realized the following parameters > are very important for scheduling in large clusters or clusters with beefy > nodes. > > Following properties in yarn-site helped job throughput: > - yarn.scheduler.fair.continuous-scheduling-enabled = true : Spins off a > thread dedicated to assigning containers to app attempts. > - yarn.scheduler.fair.assignmultiple = true : Allows multiple containers > to be assigned on each scheduling attempt. > > This speeds up scheduler performance considerably and more importantly > reduces uncertainty and noise in scheduling frequency. Surprisingly, these > didn't show up on any Hadoop presentations, docs or the usual blogs, so > hopefully this is useful for someone else. > > -Gautam. > > > > On Tue, Oct 25, 2016 at 8:09 PM Gautam <[email protected]> wrote: > >> Hello Mighty Hadoop Users, >> We'v been running into >> applications getting bottlenecked (MR/Tez) now and then. Apps get stuck in >> the ACCEPTED state and take random times to reach RUNNING. Our cluster is >> not particularly at peak load capacity wise but might be related to sudden >> submission of applications. >> >> Scenario that I'm concerned about and trying to fix/optimize: >> - Applications start piling up in ACCEPTED state. App gets submitted, >> transitions from SUBMITTED to ACCEPTED. Remains here for 5mins or 10 >> mins or even 30 mins in many cases doing nothing. >> - Queue of this app, at the time, has available capacity during this >> time. >> - There is no user-limit configured. We use fair-share scheduler so I >> don't think a default user limit is applied. *Please correct me if i'm >> wrong* >> - Suddenly get's into RUNNING and finishes as usual. >> >> We use Hadoop 2.6.0 (cdh5.7.4), most concerned configurations are >> default. These are all Mapreduce and Tez jobs. I tried increasing yarn. >> resourcemanager.scheduler.client.thread-count=100 >> and yarn.resourcemanager.amlauncher.thread-count=100 but didn't help. >> >> I have attached the RM debug log (filtered by app that was stuck for 11 >> mins) and NM log for the AM of that app. Would like to know what tuning can >> help with this. >> >> Much Appreciated, >> -Gautam. >> >
