Hi Madhav, The behaviour to me sounds normal. If the Block Size is 128 MB there could possibly be ~24 Mappers (i.e., containers used). You cannot use entire cluster as the blocks could be only in the nodes being used.
You should not try using the entire cluster resources for following reason The time required to initialize the container vs the time required to process the amount of data should be optimum to maximize the conainer utilization, that is why the block size 128 MB is been choosen, in many cases this InputSplit size is increased to optimize the containers utilization depending on the workloads. Best, Mahesh.B. On Tue, Aug 9, 2016 at 12:19 AM, Madhav Sharan <[email protected]> wrote: > Hi Hadoop users, > > I am running a m/r job with an input file of 23 million records. I can see > all our files are not getting used. > > What can I change to utilize all nodes? > > > Containers Mem Used Mem Avail Vcores used Vcores avail > 8 11.25 GB 0 B 8 0 > 0 0 B 11.25 GB 0 8 > 0 0 B 11.25 GB 0 8 > 8 11.25 GB 0 B 8 0 > 8 11.25 GB 0 B 8 0 > 7 11.25 GB 0 B 7 1 > 5 7.03 GB 4.22 GB 5 3 > 0 0 B 11.25 GB 0 8 > 0 0 B 11.25 GB 0 8 > > > My command looks like - > > hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar > gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation > /user/pts/output/MeanChiSquareAndSimilarityInput /user/pts/output/ > MeanChiSquaredCalcOutput > > Directory - */user/pts/output/MeanChiSquareAndSimilarityInput* have a > input file of 23 m records. File size is ~3 GB > > Code - https://github.com/smadha/pooled_time_series/blob/ > master/src/main/java/gov/nasa/jpl/memex/pooledtimeseries/ > MeanChiSquareDistanceCalculation.java#L135 > > > -- > Madhav Sharan > >
