Harsh,
I am assuming you mean the web-interface of the jobtracker, right? What I
see there is appended at the end of the email. Is there supposed to be a
counter which is equal to the number of data-local jobs? One obvious way to
find this would be to look at the location of the input split of each of the
mappers and see if that is the same as that of the map task.
Do I need to enable some config parameter to actually see the counter which
shows the number of data-local tasks?
Thanks
Virajith
==================================================================================
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed
Task Attempts
map 100.00%
1600 0 0 1600 0 3 / 46
reduce 100.00%
20 0 0 20 0 0 / 1
Counter Map
Reduce Total
Job Counters Launched reduce tasks 0
0 21
Rack-local map tasks 0
0 1,649
Launched map tasks 0
0 1,649
FileSystemCounters FILE_BYTES_READ 215,256,891,609
494,340,016,724 709,596,908,333
HDFS_BYTES_READ 215,481,828,554
0 215,481,828,554
FILE_BYTES_WRITTEN 430,057,823,630
494,340,016,724 924,397,840,354
HDFS_BYTES_WRITTEN 0
215,457,161,571 215,457,161,571
Map-Reduce Framework Reduce input groups 0
20,369,713 20,369,713
Combine output records 0
0 0
Map input records 20,443,005
0 20,443,005
Reduce shuffle bytes 0
214,894,166,095 214,894,166,095
Reduce output records 0
20,443,005 20,443,005
Spilled Records 40,886,010
46,997,605 87,883,615
Map output bytes 214,913,316,171
0 214,913,316,171
Map input bytes 215,457,082,591
0 215,457,082,591
Map output records 20,443,005
0 20,443,005
Combine input records 0
0 0
Reduce input records 0
20,443,005 20,443,005
On Tue, Jul 12, 2011 at 2:43 PM, Harsh J <[email protected]> wrote:
> Virajith,
>
> You can see the number of data local vs. non.'s counters in the job itself.
>
> On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti
> <[email protected]> wrote:
> > How do I find the number of data-local map tasks that are launched? I
> > checked the log files but didnt see any information about this. All the
> map
> > tasks are rack local since I am running the job just using a single rack.
> > From the completion time per map (comparing it to the case where I have
> > 1Gbps of bandwidth between the nodes i.e. the case where network
> bandwidth
> > is not a bottle neck), I saw that more than 90% of the maps are actually
> > reading data over the network.
> >
> > I understand that there might be some maps that are actually launched as
> > non-data local task but I am surprised that around 90% of the maps are
> > actually running as non-data local tasks.
> >
> > I have not measured how much bandwidth was being used but I think the
> whole
> > 50Mbps is being used.
> >
> > Thanks,
> > Virajith
> >
> >
> > On Tue, Jul 12, 2011 at 1:55 PM, Harsh J <[email protected]> wrote:
> >>
> >> How much of bandwidth did you see being utilized? What was the count
> >> of number of tasks launched as data-local map tasks versus rack local
> >> ones?
> >>
> >> A little bit of edge record data is always read over network but that
> >> is highly insignificant compared to the amount of data read locally (a
> >> whole block size, if available).
> >>
> >> On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
> >> <[email protected]> wrote:
> >> > Hi,
> >> >
> >> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of
> >> > input
> >> > data using a 20 node cluster of nodes. HDFS is configured to use 128MB
> >> > block
> >> > size (so 1600maps are created) and a replication factor of 1 is being
> >> > used.
> >> > All the 20 nodes are also hdfs datanodes. I was using a bandwidth
> value
> >> > of
> >> > 50Mbps between each of the nodes (this was configured using linux
> "tc").
> >> > I
> >> > see that around 90% of the map tasks are reading data over the network
> >> > i.e.
> >> > most of the map tasks are not being scheduled at the nodes where the
> >> > data to
> >> > be processed by them is located.
> >> > My understanding was that Hadoop tries to schedule as many data-local
> >> > maps
> >> > as possible. But in this situation, this does not seem to happen. Any
> >> > reason
> >> > why this is happening? and is there a way to actually configure hadoop
> >> > to
> >> > ensure the maximum possible node locality?
> >> > Any help regarding this is very much appreciated.
> >> >
> >> > Thanks,
> >> > Virajith
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>