Re: All nodes are not used

Mahesh Balija Tue, 09 Aug 2016 13:29:02 -0700

Hi Madhav,

The behaviour to me sounds normal.
If the Block Size is 128 MB there could possibly be ~24 Mappers (i.e.,
containers used).
You cannot use entire cluster as the blocks could be only in the nodes
being used.


You should not try using the entire cluster resources for following reason

The time required to initialize the container vs the time required to
process the amount of data should be optimum to maximize the conainer
utilization, that is why the block size 128 MB is been choosen, in many
cases this InputSplit size is increased to optimize the containers
utilization depending on the workloads.

Best,
Mahesh.B.



On Tue, Aug 9, 2016 at 12:19 AM, Madhav Sharan <[email protected]> wrote:

> Hi Hadoop users,
>
> I am running a m/r job with an input file of 23 million records. I can see
> all our files are not getting used.
>
> What can I change to utilize all nodes?
>
>
> Containers Mem Used Mem Avail Vcores used Vcores avail
> 8 11.25 GB 0 B 8 0
> 0 0 B 11.25 GB 0 8
> 0 0 B 11.25 GB 0 8
> 8 11.25 GB 0 B 8 0
> 8 11.25 GB 0 B 8 0
> 7 11.25 GB 0 B 7 1
> 5 7.03 GB 4.22 GB 5 3
> 0 0 B 11.25 GB 0 8
> 0 0 B 11.25 GB 0 8
>
>
> My command looks like -
>
> hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar
> gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation
> /user/pts/output/MeanChiSquareAndSimilarityInput /user/pts/output/
> MeanChiSquaredCalcOutput
>
> Directory - */user/pts/output/MeanChiSquareAndSimilarityInput* have a
> input file of 23 m records. File size is ~3 GB
>
> Code - https://github.com/smadha/pooled_time_series/blob/
> master/src/main/java/gov/nasa/jpl/memex/pooledtimeseries/
> MeanChiSquareDistanceCalculation.java#L135
>
>
> --
> Madhav Sharan
>
>

Re: All nodes are not used

Reply via email to