Upon further inspection of that log, it appears the problem is the startup task just takes a very long time.
Typically it is taking at most 6 seconds, but sometimes (the cases I think its hanging) it actually runs and finishes but takes 3-5 minutes. Same problem with the cleanup (which is where I thought the reduce was getting stuck). I am currently the only user on this cluster and I never have more than 1 job in the queue at a time. Ideas? On Fri, Jul 13, 2012 at 1:04 AM, Harsh J <[email protected]> wrote: > Hey Robert, > > Any chance you can pastebin the JT logs, grepped for the bad job ID, > and send the link across? They shouldn't hang the way you describe. > > On Fri, Jul 13, 2012 at 9:33 AM, Robert Dyer <[email protected]> wrote: > > I'm using Hadoop 1.0.3 on a small cluster (1 namenode, 1 jobtracker, 2 > > compute nodes). My input size is a sequence file of around 280mb. > > > > Generally, my jobs run just fine and all finish in 2-5 minutes. However, > > quite randomly the jobs refuse to run. They submit and appear when > running > > 'hadoop job -list' but don't appear on the jobtracker's webpage. If I > > manually type in the job ID on the webpage I can see it is trying to run > the > > setup task - the map tasks haven't even started. I've left them to run > and > > even after several minutes it is still in this state. > > > > When I spot this, I kill the job and resubmit it and generally it works. > > > > A couple of times I have seen similar problems with reduce tasks that get > > stuck while 'initializing'. > > > > Any ideas? > > >
