Re: Lack of data locality in Hadoop-0.20.2

Virajith Jalaparti Tue, 12 Jul 2011 07:00:33 -0700

Harsh,

I am assuming you mean the web-interface of the jobtracker, right? What I
see there is appended at the end of the email. Is there supposed to be a
counter which is equal to the number of data-local jobs? One obvious way to
find this would be to look at the location of the input split of each of the
mappers and see if that is the same as that of the map task.


Do I need to enable some config parameter to actually see the counter which
shows the number of data-local tasks?

Thanks
Virajith


==================================================================================
    Kind  % Complete Num Tasks Pending Running Complete Killed Failed/Killed
                                                               Task Attempts
    map      100.00%
                          1600       0       0     1600      0        3 / 46
   reduce    100.00%
                            20       0       0       20      0         0 / 1

                               Counter               Map
Reduce           Total
   Job Counters         Launched reduce tasks                0
0              21
                        Rack-local map tasks                 0
0           1,649
                        Launched map tasks                   0
0           1,649
   FileSystemCounters   FILE_BYTES_READ        215,256,891,609
494,340,016,724 709,596,908,333
                        HDFS_BYTES_READ        215,481,828,554
0 215,481,828,554
                        FILE_BYTES_WRITTEN     430,057,823,630
494,340,016,724 924,397,840,354
                        HDFS_BYTES_WRITTEN                   0
215,457,161,571 215,457,161,571
   Map-Reduce Framework Reduce input groups                  0
20,369,713      20,369,713
                        Combine output records               0
0               0
                        Map input records           20,443,005
0      20,443,005
                        Reduce shuffle bytes                 0
214,894,166,095 214,894,166,095
                        Reduce output records                0
20,443,005      20,443,005
                        Spilled Records             40,886,010
46,997,605      87,883,615
                        Map output bytes       214,913,316,171
0 214,913,316,171
                        Map input bytes        215,457,082,591
0 215,457,082,591
                        Map output records          20,443,005
0      20,443,005
                        Combine input records                0
0               0
                        Reduce input records                 0
20,443,005      20,443,005


On Tue, Jul 12, 2011 at 2:43 PM, Harsh J <[email protected]> wrote:

> Virajith,
>
> You can see the number of data local vs. non.'s counters in the job itself.
>
> On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti
> <[email protected]> wrote:
> > How do I find the number of data-local map tasks that are launched? I
> > checked the log files but didnt see any information about this. All the
> map
> > tasks are rack local since I am running the job just using a single rack.
> > From the completion time per map (comparing it to the case where I have
> > 1Gbps of bandwidth between the nodes i.e. the case where network
> bandwidth
> > is not a bottle neck), I saw that more than 90% of the maps are actually
> > reading data over the network.
> >
> > I understand that there might be some maps that  are actually launched as
> > non-data local task but  I am surprised that around 90% of the maps are
> > actually running as non-data local tasks.
> >
> > I have not measured how much bandwidth was being used but I think the
> whole
> > 50Mbps is being used.
> >
> > Thanks,
> > Virajith
> >
> >
> > On Tue, Jul 12, 2011 at 1:55 PM, Harsh J <[email protected]> wrote:
> >>
> >> How much of bandwidth did you see being utilized? What was the count
> >> of number of tasks launched as data-local map tasks versus rack local
> >> ones?
> >>
> >> A little bit of edge record data is always read over network but that
> >> is highly insignificant compared to the amount of data read locally (a
> >> whole block size, if available).
> >>
> >> On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
> >> <[email protected]> wrote:
> >> > Hi,
> >> >
> >> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of
> >> > input
> >> > data using a 20 node cluster of nodes. HDFS is configured to use 128MB
> >> > block
> >> > size (so 1600maps are created) and a replication factor of 1 is being
> >> > used.
> >> > All the 20 nodes are also hdfs datanodes. I was using a bandwidth
> value
> >> > of
> >> > 50Mbps between each of the nodes (this was configured using linux
> "tc").
> >> > I
> >> > see that around 90% of the map tasks are reading data over the network
> >> > i.e.
> >> > most of the map tasks are not being scheduled at the nodes where the
> >> > data to
> >> > be processed by them is located.
> >> > My understanding was that Hadoop tries to schedule as many data-local
> >> > maps
> >> > as possible. But in this situation, this does not seem to happen. Any
> >> > reason
> >> > why this is happening? and is there a way to actually configure hadoop
> >> > to
> >> > ensure the maximum possible node locality?
> >> > Any help regarding this is very much appreciated.
> >> >
> >> > Thanks,
> >> > Virajith
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Lack of data locality in Hadoop-0.20.2

Reply via email to