Re: Allocation of containers to tasks in Hadoop

Or Raz Thu, 10 Jan 2019 08:03:23 -0800

Thanks a lot!
It is now working better!
Such a small parameter that I didn't know that exists and is not so common
to modify.


Or

‫בתאריך יום ה׳, 10 בינו׳ 2019 ב-16:31 מאת ‪Hariharan‬‏ <‪
[email protected]‬‏>:‬

> Not an expert on capacity scheduler but the above two are not queue-level
> configurations, so I think the changes would not reflect on running
> refreshqueues. You would need to restart the RM for the new values to take
> effect.
>
> Thanks,
> Hari
>
> On Thu, Jan 10, 2019 at 7:41 PM Or Raz <[email protected]> wrote:
>
>> I have googled more about it, and it seems like two parameters should
>> define the "bin packing problem".
>> According to
>> https://hadoop.apache.org/docs/r2.9.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Other_Properties
>>   yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled is
>> by default set to true and with parameter
>> yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments r
>> set to -1 it can assign all the containers the Node manager "said" it is
>> capable of (which could somehow explain the bin packing problem for the
>> first Nodemanager who answer with a Heartbeat message).
>> Following Apache's instructions, I have inserted to my
>> *capacity-scheduler.xml*  in hadoop/etc/hadoop folder
>>
>>   <property>
>>
>> <name>yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled</name>
>>     <value>true</value>
>>     <description>
>>         Whether to allow multiple container assignments in one
>> NodeManager heartbeat. Defaults to true.
>>     </description>
>>   </property>
>>   <property>
>>
>> <name>yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments</name>
>>     <value>2</value>
>>     <description>
>>         If multiple-assignments-enabled is true, the maximum amount of
>> containers that can be assigned in one NodeManager heartbeat. Defaults to
>> -1, which sets no limit.
>>     </description>
>>   </property>
>> I have checked the configuration file, and I am using the capacity
>> scheduler (I have enabled
>> yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled again
>> just to be sure).
>> Furthermore, after I have run "yarn rmadmin -refreshQueues" I haven't
>> seen any change in the Mappers allocation nor Reducers.
>> hadoop2@master:~$ yarn rmadmin -refreshQueues
>> 19/01/10 16:06:33 INFO client.RMProxy: Connecting to ResourceManager at
>> master/172.31.24.83:8033
>>
>> What am I missing over here?
>>
>> Or
>>
>>
>> ‫בתאריך יום ד׳, 9 בינו׳ 2019 ב-23:57 מאת ‪Or Raz‬‏ <‪[email protected]
>> ‬‏>:‬
>>
>>> Thanks for the tips!
>>> Because I haven't set any scheduler (on purpose) for YARN then, I am
>>> using the default one (Capacity).
>>> I have looked in yarn-site.xml and in the configuration tab (using
>>> JobHistory UI), and both of the parameters that you have mentioned weren't
>>> there (so they haven't been set).
>>> You said that I should look at "locality settings" can you be more
>>> specific on what and where to look?
>>> Also, it is worth mentioning that I am using three computers and the
>>> replication factor (of HDFS) is three too. Thus, every data (even input)
>>> would be on every computer, and the memory of each computer is the same
>>> (two t2.xlarge and one m4.xlarge) while I am
>>> using DefaultResourceCalculator.
>>>
>>> Or
>>>
>>> ‫בתאריך יום ד׳, 9 בינו׳ 2019 ב-23:28 מאת ‪Aaron Eng‬‏ <‪[email protected]
>>> ‬‏>:‬
>>>
>>>> The settings are very relevant to having an equal number of containers
>>>> running on each node if you have an idle cluster and want to distribute
>>>> containers for a single job.  An application master submits requests for
>>>> container allocations to the ResourceManager.  The MRAppMaster will request
>>>> all the map containers at once, the FairScheduler will find NodeManagers
>>>> with capacity to fulfill the container requests.  If assign multiple is
>>>> enabled then you generally won't get an even number of containers assigned
>>>> to each node +/- 1 container.  Before you say it's not relevant, you should
>>>> check if your environment uses the FairScheduler and whether multiple
>>>> assignment is enabled.  If so, that's likely why there isn't an even
>>>> assignment +/- 1 container.  If not using FairScheduler and/or multiple
>>>> assign, then you should look at locality settings, which can cause
>>>> containers to be preferentially run on a subset of nodes, resulting in an
>>>> uneven container assignment per node.
>>>>
>>>> On Wed, Jan 9, 2019 at 2:19 PM Or Raz <[email protected]> wrote:
>>>>
>>>>> As far as I know, the scheduler in YARN is only scheduling the jobs
>>>>> and not the containers inside each job. Therefore, I don't believe it is
>>>>> relevant.
>>>>> Also, I haven't used or set those two parameters, and I haven't picked
>>>>> nor set any particular schedule for my research (Fair, FIFO or Capacity).
>>>>> Please correct if I am wrong.
>>>>> P.S. currently I have no interest in a situation when I run a few jobs
>>>>> concurrently, my case is much simpler with one job that I would like that
>>>>> allocation of containers will be more balanced...
>>>>> Or
>>>>>
>>>>>
>>>>> ‫בתאריך יום ד׳, 9 בינו׳ 2019 ב-19:11 מאת ‪Aaron Eng‬‏ <‪[email protected]
>>>>> ‬‏>:‬
>>>>>
>>>>>> Have you checked the yarn.scheduler.fair.assignmultiple
>>>>>> and yarn.scheduler.fair.max.assign parameters for the ResourceManager
>>>>>> configuration?
>>>>>>
>>>>>> On Wed, Jan 9, 2019 at 9:49 AM Or Raz <[email protected]> wrote:
>>>>>>
>>>>>>> How can I change/suggest a different allocation of containers to
>>>>>>> tasks in Hadoop? Regarding a native Hadoop (2.9.1) cluster on AWS.
>>>>>>>
>>>>>>> I am running a native Hadoop cluster (2.9.1) on AWS (with EC2, not
>>>>>>> EMR) and I want the scheduling/allocating of the containers
>>>>>>> (Mappers/Reducers) would be more balanced than it is currently. It seems
>>>>>>> like RM is assigning the Mappers in a Bin Packing way (where the data
>>>>>>> resides) and for the reducers, it looks more balanced. My setup includes
>>>>>>> three Machines with replication rate three (all the data is on every
>>>>>>> machine), and I run my jobs with
>>>>>>> mapreduce.job.reduce.slowstart.completedmaps=0 to start shuffle as fast 
>>>>>>> as
>>>>>>> possible (It is vital for me that all the containers are working in
>>>>>>> concurrency, it is a must condition). Also, according to the EC2 
>>>>>>> instances
>>>>>>> I have chosen and my settings of the YARN cluster, I can run at most 93
>>>>>>> containers (31 each).
>>>>>>>
>>>>>>> For example, if I want to have nine reducers then (93-9-1=83), 83
>>>>>>> containers could be left for the mappers, and one is for the AM. I have
>>>>>>> played with the size of split input
>>>>>>> (mapreduce.input.fileinputformat.split.minsize,
>>>>>>> mapreduce.input.fileinputformat.split.maxsize) to find the right balance
>>>>>>> where all of the machines have the same "work" for the map phase. But it
>>>>>>> seems like the first 31 mappers would be allocated in one computer, the
>>>>>>> next 31 to the second one and the last 31 in the last machine. Thus, I 
>>>>>>> can
>>>>>>> try to use 87 mappers where 31 of them in Machine #1, another 31 in 
>>>>>>> Machine
>>>>>>> #2 and another 25 in Machine #3 and the rest is left for the reducers 
>>>>>>> and
>>>>>>> as Machine #1 and Machine #2 are fully occupied then the reducers would
>>>>>>> have to be placed in Machine #3. This way I get an almost balanced
>>>>>>> allocation of mappers at the expense of unbalanced reducers allocation. 
>>>>>>> And
>>>>>>> this is not what I want...
>>>>>>>
>>>>>>> # of mappers = size_input / split size [Bytes]
>>>>>>>
>>>>>>> split size
>>>>>>> =max(mapreduce.input.fileinputformat.split.minsize,min(mapreduce.input.fileinputformat.split.maxsize,
>>>>>>> dfs.blocksize))
>>>>>>>
>>>>>>

Re: Allocation of containers to tasks in Hadoop

Reply via email to