The settings are very relevant to having an equal number of containers
running on each node if you have an idle cluster and want to distribute
containers for a single job.  An application master submits requests for
container allocations to the ResourceManager.  The MRAppMaster will request
all the map containers at once, the FairScheduler will find NodeManagers
with capacity to fulfill the container requests.  If assign multiple is
enabled then you generally won't get an even number of containers assigned
to each node +/- 1 container.  Before you say it's not relevant, you should
check if your environment uses the FairScheduler and whether multiple
assignment is enabled.  If so, that's likely why there isn't an even
assignment +/- 1 container.  If not using FairScheduler and/or multiple
assign, then you should look at locality settings, which can cause
containers to be preferentially run on a subset of nodes, resulting in an
uneven container assignment per node.

On Wed, Jan 9, 2019 at 2:19 PM Or Raz <[email protected]> wrote:

> As far as I know, the scheduler in YARN is only scheduling the jobs and
> not the containers inside each job. Therefore, I don't believe it is
> relevant.
> Also, I haven't used or set those two parameters, and I haven't picked nor
> set any particular schedule for my research (Fair, FIFO or Capacity).
> Please correct if I am wrong.
> P.S. currently I have no interest in a situation when I run a few jobs
> concurrently, my case is much simpler with one job that I would like that
> allocation of containers will be more balanced...
> Or
>
>
> ‫בתאריך יום ד׳, 9 בינו׳ 2019 ב-19:11 מאת ‪Aaron Eng‬‏ <‪[email protected]‬‏>:‬
>
>> Have you checked the yarn.scheduler.fair.assignmultiple
>> and yarn.scheduler.fair.max.assign parameters for the ResourceManager
>> configuration?
>>
>> On Wed, Jan 9, 2019 at 9:49 AM Or Raz <[email protected]> wrote:
>>
>>> How can I change/suggest a different allocation of containers to tasks
>>> in Hadoop? Regarding a native Hadoop (2.9.1) cluster on AWS.
>>>
>>> I am running a native Hadoop cluster (2.9.1) on AWS (with EC2, not EMR)
>>> and I want the scheduling/allocating of the containers (Mappers/Reducers)
>>> would be more balanced than it is currently. It seems like RM is assigning
>>> the Mappers in a Bin Packing way (where the data resides) and for the
>>> reducers, it looks more balanced. My setup includes three Machines with
>>> replication rate three (all the data is on every machine), and I run my
>>> jobs with mapreduce.job.reduce.slowstart.completedmaps=0 to start shuffle
>>> as fast as possible (It is vital for me that all the containers are working
>>> in concurrency, it is a must condition). Also, according to the EC2
>>> instances I have chosen and my settings of the YARN cluster, I can run at
>>> most 93 containers (31 each).
>>>
>>> For example, if I want to have nine reducers then (93-9-1=83), 83
>>> containers could be left for the mappers, and one is for the AM. I have
>>> played with the size of split input
>>> (mapreduce.input.fileinputformat.split.minsize,
>>> mapreduce.input.fileinputformat.split.maxsize) to find the right balance
>>> where all of the machines have the same "work" for the map phase. But it
>>> seems like the first 31 mappers would be allocated in one computer, the
>>> next 31 to the second one and the last 31 in the last machine. Thus, I can
>>> try to use 87 mappers where 31 of them in Machine #1, another 31 in Machine
>>> #2 and another 25 in Machine #3 and the rest is left for the reducers and
>>> as Machine #1 and Machine #2 are fully occupied then the reducers would
>>> have to be placed in Machine #3. This way I get an almost balanced
>>> allocation of mappers at the expense of unbalanced reducers allocation. And
>>> this is not what I want...
>>>
>>> # of mappers = size_input / split size [Bytes]
>>>
>>> split size
>>> =max(mapreduce.input.fileinputformat.split.minsize,min(mapreduce.input.fileinputformat.split.maxsize,
>>> dfs.blocksize))
>>>
>>

Reply via email to