How can I change/suggest a different allocation of containers to tasks in
Hadoop? Regarding a native Hadoop (2.9.1) cluster on AWS.

I am running a native Hadoop cluster (2.9.1) on AWS (with EC2, not EMR) and
I want the scheduling/allocating of the containers (Mappers/Reducers) would
be more balanced than it is currently. It seems like RM is assigning the
Mappers in a Bin Packing way (where the data resides) and for the reducers,
it looks more balanced. My setup includes three Machines with replication
rate three (all the data is on every machine), and I run my jobs with
mapreduce.job.reduce.slowstart.completedmaps=0 to start shuffle as fast as
possible (It is vital for me that all the containers are working in
concurrency, it is a must condition). Also, according to the EC2 instances
I have chosen and my settings of the YARN cluster, I can run at most 93
containers (31 each).

For example, if I want to have nine reducers then (93-9-1=83), 83
containers could be left for the mappers, and one is for the AM. I have
played with the size of split input
(mapreduce.input.fileinputformat.split.minsize,
mapreduce.input.fileinputformat.split.maxsize) to find the right balance
where all of the machines have the same "work" for the map phase. But it
seems like the first 31 mappers would be allocated in one computer, the
next 31 to the second one and the last 31 in the last machine. Thus, I can
try to use 87 mappers where 31 of them in Machine #1, another 31 in Machine
#2 and another 25 in Machine #3 and the rest is left for the reducers and
as Machine #1 and Machine #2 are fully occupied then the reducers would
have to be placed in Machine #3. This way I get an almost balanced
allocation of mappers at the expense of unbalanced reducers allocation. And
this is not what I want...

# of mappers = size_input / split size [Bytes]

split size
=max(mapreduce.input.fileinputformat.split.minsize,min(mapreduce.input.fileinputformat.split.maxsize,
dfs.blocksize))

Reply via email to