How can I change/suggest a different allocation of containers to tasks in Hadoop? Regarding a native Hadoop (2.9.1) cluster on AWS.
I am running a native Hadoop cluster (2.9.1) on AWS (with EC2, not EMR) and I want the scheduling/allocating of the containers (Mappers/Reducers) would be more balanced than it is currently. It seems like RM is assigning the Mappers in a Bin Packing way (where the data resides) and for the reducers, it looks more balanced. My setup includes three Machines with replication rate three (all the data is on every machine), and I run my jobs with mapreduce.job.reduce.slowstart.completedmaps=0 to start shuffle as fast as possible (It is vital for me that all the containers are working in concurrency, it is a must condition). Also, according to the EC2 instances I have chosen and my settings of the YARN cluster, I can run at most 93 containers (31 each). For example, if I want to have nine reducers then (93-9-1=83), 83 containers could be left for the mappers, and one is for the AM. I have played with the size of split input (mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize) to find the right balance where all of the machines have the same "work" for the map phase. But it seems like the first 31 mappers would be allocated in one computer, the next 31 to the second one and the last 31 in the last machine. Thus, I can try to use 87 mappers where 31 of them in Machine #1, another 31 in Machine #2 and another 25 in Machine #3 and the rest is left for the reducers and as Machine #1 and Machine #2 are fully occupied then the reducers would have to be placed in Machine #3. This way I get an almost balanced allocation of mappers at the expense of unbalanced reducers allocation. And this is not what I want... # of mappers = size_input / split size [Bytes] split size =max(mapreduce.input.fileinputformat.split.minsize,min(mapreduce.input.fileinputformat.split.maxsize, dfs.blocksize))
