It looks like input data is not splited correctly. It always generates only one map task and gives it to one of the nodes. I tried to pass parameters like -D mapred.max.split.size but it doesn't seem to have any effect.
So the question would be: how to specify the maximum amount of input records each mapper can receive? On Tue, Oct 25, 2011 at 10:56 AM, Artem Yankov <[email protected]>wrote: > Hey, > > I set up a hadoop cluster on EC2 using this documentation: > http://wiki.apache.org/hadoop/AmazonEC2 > > OS: Linux Fedora 8 > Hadoop version is 0.20.203.0 > java version "1.7.0_01" > heap size: 1Gb (stats always shows that it uses only 4% of this) > I use mongo-hadoop plugin to get data from mongodb. > > Everything seems to work perfect with the small chunks of data: > calculations are fast, I'm getting the results and tasks > seem to be distributed normally among the slaves. > > Then I try to load a huge amount of data (22 Millions of records) and > everything hangs. First slave receives a map task and other slaves are not. > In logs I constantly see this: > > *INFO org.apache.hadoop.hdfs.StateChange: *BLOCK* > NameSystem.processReport: from x.x.x.x:50010, blocks: 2, processing time: 0 > m* > * > * > I tried to use different number of slaves (maximum I ran 25 nodes), but it > doesn't help cause it seems that when first slave receives a job it blocks > everything else. (again, everything works cool with the small chunks of > data). > > There are no significant CPU or Memory load on Master. > > Any ideas on what can be a reason of this? > > Artem. > >
