Hello!
How many reducers you are using?
Regarding the performance parameters, fist you can increase the size of the
io.sort.mb parameter.
It seems that you are sending a lot of amount of data to the reducer. By
increasing the value of this parameter, in the shuffle phase, the framework
will not be forced to write/spill data on the HDD that could be a reason for
slowing the process.
If you are using one reducer, then the whole data is sent over HTTP to that
reducer. Another thing that you have to think about it.
Just for a curiosity, try increase also the dfs.block.size to 128 MB. It seems
that you are using the default 64 MB. You'll get less mapper tasks.
Also, depending what configuration you have on the machine how many cores do
you have on CPU, you can increase the values for
mapred.tasktracker.{map|reduce}.tasks.maximum The maximum number of
Map/Reduce tasks, which are run simultaneously on a given TaskTracker,
individually. Defaults to 2 (2 maps and 2 reduces), but vary it depending
on your hardware
You can have a look at
http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html.
A good book for understanding tuning parameters is Hadoop Definitive Guide by
Tom White.
Hope that the above helps.
Regards,
Florin
--- On Thu, 11/3/11, Steve Lewis <[email protected]> wrote:
From: Steve Lewis <[email protected]>
Subject: Problems with MR Job running really slowly
To: "mapreduce-user" <[email protected]>
Date: Thursday, November 3, 2011, 11:07 PM
I have a job which takes an xml file - the splitter breaks the file into tags,
the mapper parses each tag and sends the data to the reducer. I am using a
custom splitter which reads
the file looking for start and end tags.
When I run the code in the splitter and the mapper - generating separate tags
and parsing them I can read a file sized at about 500MB containing 12000 tags
on my local system in 23 seconds
When I read a file on HDFS on a local cluster I can read and parse the file in
38 seconds
When I run the same code on a eight node cluster I get 7 map tasks. The mappers
are taking 190 seconds to handle 100 tags of
which 200 millisec is parsing and almost all of the rest of the time is
in context.write. A mapper handling 1600 tags takes about 3 hours -These are
the statistics for a map task - it it true that one tag well be sent to about
300 keys but still 3 hours to write 1,5 million records and 5Gb seems
way excessive
FileSystemCountersFILE_BYTES_READ 816,935,457HDFS_BYTES_READ
439,554,860FILE_BYTES_WRITTEN 1,667,745,197
PerformanceTotalScoredScans 1,660
Map-Reduce FrameworkCombine output records0Map input records 6,134Spilled
Records 1,690,063Map output bytes 5,517,423,780Combine input records 0
Map output records 571,475
Anyone want to offer suggestions on how to tune the job better
--
Steven M. Lewis PhD4221 105th Ave NEKirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com