Thanks for the quick response, appreciate it. It looks like this might be
the issue. But I am still trying to understand what is causing so many
threads in my situation? Is this thread per block that gets created or per
file? Because if it's per file then it should not be more than 15.
My second question, I read around 5 .gz files in 5 separate processed. This
is constant and also the size of those 5 is roughly equivalent. So then why
does it fail only halfway and not right in the begining. I am reading
around 400 files and it always fails when I reach around 180th file.

What's the default value of xceivers? Is 4096 consume too much of stack
size?

Thanks
On Sun, Apr 29, 2012 at 1:14 PM, Harsh J <[email protected]> wrote:

> It sounds to me like you're running out of DN xceivers. Try the
> solution offered at
> http://hbase.apache.org/book.html#dfs.datanode.max.xcievers
>
> I.e., add:
>
> <property>
>        <name>dfs.datanode.max.xcievers</name>
>        <value>4096</value>
>  </property>
>
> To your DNs' config/hdfs-site.xml and restart the DNs.
>
> On Mon, Apr 30, 2012 at 1:35 AM, Mohit Anchlia <[email protected]>
> wrote:
> > I even tried to lower number of parallel jobs even further but I still
> get
> > these errors. Any suggestion on how to troubleshoot this issue would be
> > very helpful. Should I run hadoop fsck? How do people troubleshoot such
> > issues?? Does it sound like a bug?
> >
> > 2012-04-27 14:37:42,921 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 1 map-reduce job(s) waiting for submission.
> > 2012-04-27 14:37:42,931 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream 125.18.62.199:50010
> java.io.EOFException
> > 2012-04-27 14:37:42,932 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_6343044536824463287_24619
> > 2012-04-27 14:37:42,932 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Excluding datanode 125.18.62.199:50010
> > 2012-04-27 14:37:42,935 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream 125.18.62.204:50010
> java.io.EOFException
> > 2012-04-27 14:37:42,935 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_2837215798109471362_24620
> > 2012-04-27 14:37:42,936 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Excluding datanode 125.18.62.204:50010
> > 2012-04-27 14:37:42,937 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 1 map-reduce job(s) waiting for submission.
> > 2012-04-27 14:37:42,939 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream 125.18.62.198:50010
> java.io.EOFException
> > 2012-04-27 14:37:42,939 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_2223489090936415027_24620
> > 2012-04-27 14:37:42,940 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Excluding datanode 125.18.62.198:50010
> > 2012-04-27 14:37:42,943 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream 125.18.62.197:50010
> java.io.EOFException
> > 2012-04-27 14:37:42,943 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_1265169201875643059_24620
> > 2012-04-27 14:37:42,944 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Excluding datanode 125.18.62.197:50010
> > 2012-04-27 14:37:42,945 [Thread-5] WARN
>  org.apache.hadoop.hdfs.DFSClient -
> > DataStreamer Exception: java.io.IOException: Unable to create new block.
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3446)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2627)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2822)
> > 2012-04-27 14:37:42,945 [Thread-5] WARN
>  org.apache.hadoop.hdfs.DFSClient -
> > Error Recovery for block blk_1265169201875643059_24620 bad datanode[0]
> > nodes == null
> > 2012-04-27 14:37:42,945 [Thread-5] WARN
>  org.apache.hadoop.hdfs.DFSClient -
> > Could not get block locations. Source file
> >
> "/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201204261707_0411/job.jar"
> > - Aborting...
> > 2012-04-27 14:37:42,945 [Thread-4] INFO
>  org.apache.hadoop.mapred.JobClient
> > - Cleaning up the staging area
> >
> hdfs://dsdb1:54310/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201204261707_0411
> > 2012-04-27 14:37:42,945 [Thread-4] ERROR
> > org.apache.hadoop.security.UserGroupInformation -
> > PriviledgedActionException as:hadoop (auth:SIMPLE)
> > cause:java.io.EOFException
> > 2012-04-27 14:37:42,996 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream
> > 125.18.62.200:50010java.io.IOException: Bad connect ack with
> > firstBadLink as
>  > 125.18.62.198:50010
> > 2012-04-27 14:37:42,996 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_-7583284266913502018_24621
> > 2012-04-27 14:37:42,997 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream 125.18.62.198:50010
> java.io.EOFException
> > 2012-04-27 14:37:42,997 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_4207260385919079785_24622
> > 2012-04-27 14:37:42,998 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Excluding datanode 125.18.62.198:50010
> > 2012-04-27 14:37:43,000 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Excluding datanode 125.18.62.198:50010
> > 2012-04-27 14:37:43,002 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream 125.18.62.197:50010
> java.io.EOFException
> > 2012-04-27 14:37:43,002 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_-2859304645525022496_24624
> > 2012-04-27 14:37:43,003 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Excluding datanode 125.18.62.197:50010
> > 2012-04-27 14:37:43,003 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream 125.18.62.198:50010
> java.io.EOFException
> > 2012-04-27 14:37:43,004 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_-5091361633954135154_24622
> > 2012-04-27 14:37:43,004 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream 125.18.62.199:50010
> java.io.EOFException
> > 2012-04-27 14:37:43,004 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_-1445223397912067500_24624
> > 2012-04-27 14:37:43,005 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Excluding datanode 125.18.62.198:50010
> > 2012-04-27 14:37:43,005 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Excluding datanode 125.18.62.199:50010
> > 2012-04-27 14:37:43,006 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream 125.18.62.204:50010
> java.io.EOFException
> > 2012-04-27 14:37:43,006 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_4137744363907213546_24624
> > 2012-04-27 14:37:43,007 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Excluding datanode 125.18.62.204:50010
> > 2012-04-27 14:37:43,008 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream 125.18.62.204:50010
> java.io.EOFException
> > 2012-04-27 14:37:43,008 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_4553692535678376597_24624
> > 2012-04-27 14:37:43,008 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Exception in createBlockOutputStream 125.18.62.197:50010
> java.io.EOFException
> > 2012-04-27 14:37:43,008 [Thread-5] INFO
>  org.apache.hadoop.hdfs.DFSClient -
> > Abandoning block blk_-7407489373889053706_24624
> >
> >
> > On Fri, Apr 27, 2012 at 3:45 PM, Mohit Anchlia <[email protected]
> >wrote:
> >
> >> After all the jobs fail I can't run anything. Once I restart the
> cluster I
> >> am able to run other jobs with no problems, hadoop fs and other io
> >> intensive jobs run just fine.
> >>
> >>
> >> On Fri, Apr 27, 2012 at 3:12 PM, John George <[email protected]
> >wrote:
> >>
> >>> Can you run a regular 'hadoop fs' (put orls or get) command?
> >>> If yes, how about a wordcount example?
> >>> '<path>/hadoop jar <path>hadoop-*examples*.jar wordcount input output'
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Mohit Anchlia <[email protected]>
> >>> Reply-To: "[email protected]" <
> [email protected]>
> >>> Date: Fri, 27 Apr 2012 14:36:49 -0700
> >>> To: "[email protected]" <[email protected]>
> >>> Subject: Re: DFSClient error
> >>>
> >>> >I even tried to reduce number of jobs but didn't help. This is what I
> >>> see:
> >>> >
> >>> >datanode logs:
> >>> >
> >>> >Initializing secure datanode resources
> >>> >Successfully obtained privileged resources (streaming port =
> >>> >ServerSocket[addr=/0.0.0.0,localport=50010] ) (http listener port =
> >>> >sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:50075])
> >>> >Starting regular datanode initialization
> >>> >26/04/2012 17:06:51 9858 jsvc.exec error: Service exit with a return
> >>> value
> >>> >of 143
> >>> >
> >>> >userlogs:
> >>> >
> >>> >2012-04-26 19:35:22,801 WARN
> >>> >org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native
> library is
> >>> >available
> >>> >2012-04-26 19:35:22,801 INFO
> >>> >org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library
> >>> >loaded
> >>> >2012-04-26 19:35:22,808 INFO
> >>> >org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
> >>> >initialized native-zlib library
> >>> >2012-04-26 19:35:22,903 INFO org.apache.hadoop.hdfs.DFSClient: Failed
> to
> >>> >connect to /125.18.62.197:50010, add to deadNodes and continue
> >>> >java.io.EOFException
> >>> >        at java.io.DataInputStream.readShort(DataInputStream.java:298)
> >>> >        at
> >>>
> >>>
> >org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.newBlockReader(DFSClien
> >>> >t.java:1664)
> >>> >        at
> >>>
> >>>
> >org.apache.hadoop.hdfs.DFSClient$DFSInputStream.getBlockReader(DFSClient.j
> >>> >ava:2383)
> >>> >        at
> >>>
> >>>
> >org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java
> >>> >:2056)
> >>> >        at
> >>>
> >org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2170)
> >>> >        at java.io.DataInputStream.read(DataInputStream.java:132)
> >>> >        at
> >>>
> >>>
> >org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(Decompr
> >>> >essorStream.java:97)
> >>> >        at
> >>>
> >>>
> >org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorSt
> >>> >ream.java:87)
> >>> >        at
> >>>
> >>>
> >org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.j
> >>> >ava:75)
> >>> >        at java.io.InputStream.read(InputStream.java:85)
> >>> >        at
> >>> >org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205)
> >>> >        at
> >>> org.apache.hadoop.util.LineReader.readLine(LineReader.java:169)
> >>> >        at
> >>>
> >>>
> >org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRe
> >>> >cordReader.java:114)
> >>> >        at
> org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109)
> >>> >        at
> >>>
> >>>
> >org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordRead
> >>> >er.nextKeyValue(PigRecordReader.java:187)
> >>> >        at
> >>>
> >>>
> >org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapT
> >>> >ask.java:456)
> >>> >        at
> >>>
> >org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> >>> >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> >>> >        at
> >>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> >>> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> >>> >        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> >>> >        at java.security.AccessController.doPrivileged(Native Method)
> >>> >        at javax.security.auth.Subject.doAs(Subject.java:396)
> >>> >        at
> >>>
> >>>
> >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
> >>> >java:1157)
> >>> >        at org.apache.hadoop.mapred.Child.main(Child.java:264)
> >>> >2012-04-26 19:35:22,906 INFO org.apache.hadoop.hdfs.DFSClient: Failed
> to
> >>> >connect to /125.18.62.204:50010, add to deadNodes and continue
> >>> >java.io.EOFException
> >>> >
> >>> >namenode logs:
> >>> >
> >>> >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: Job
> >>> >job_201204261140_0244 added successfully for user 'hadoop' to queue
> >>> >'default'
> >>> >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker:
> >>> >Initializing job_201204261140_0244
> >>> >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.AuditLogger:
> >>> >USER=hadoop  IP=125.18.62.196        OPERATION=SUBMIT_JOB
> >>> >TARGET=job_201204261140_0244    RESULT=SUCCESS
> >>> >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobInProgress:
> >>> >Initializing job_201204261140_0244
> >>> >2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception
> >>> >in
> >>> >createBlockOutputStream 125.18.62.198:50010 java.io.IOException: Bad
> >>> >connect ack with firstBadLink as 125.18.62.197:50010
> >>> >2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning
> >>> >block blk_2499580289951080275_22499
> >>> >2012-04-26 16:12:53,582 INFO org.apache.hadoop.hdfs.DFSClient:
> Excluding
> >>> >datanode 125.18.62.197:50010
> >>> >2012-04-26 16:12:53,594 INFO org.apache.hadoop.mapred.JobInProgress:
> >>> >jobToken generated and stored with users keys in
> >>> >/data/hadoop/mapreduce/job_201204261140_0244/jobToken
> >>> >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
> >>> Input
> >>> >size for job job_201204261140_0244 = 73808305. Number of splits = 1
> >>> >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
> >>> >tip:task_201204261140_0244_m_000000 has split on node:/default-rack/
> >>> >dsdb4.corp.intuit.net
> >>> >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
> >>> >tip:task_201204261140_0244_m_000000 has split on node:/default-rack/
> >>>  >dsdb5.corp.intuit.net
> >>> >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
> >>> >job_201204261140_0244 LOCALITY_WAIT_FACTOR=0.4
> >>> >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
> Job
> >>> >job_201204261140_0244 initialized successfully with 1 map tasks and 0
> >>> >reduce tasks.
> >>> >
> >>> >On Fri, Apr 27, 2012 at 7:50 AM, Mohit Anchlia
> >>> ><[email protected]>wrote:
> >>> >
> >>> >>
> >>> >>
> >>> >>  On Thu, Apr 26, 2012 at 10:24 PM, Harsh J <[email protected]>
> wrote:
> >>> >>
> >>> >>> Is only the same IP printed in all such messages? Can you check
> the DN
> >>> >>> log in that machine to see if it reports any form of issues?
> >>> >>>
> >>> >>> All IPs were logged with this message
> >>> >>
> >>> >>
> >>> >>> Also, did your jobs fail or kept going despite these hiccups? I
> notice
> >>> >>> you're threading your clients though (?), but I can't tell if that
> may
> >>> >>> cause this without further information.
> >>> >>>
> >>> >>> It started with this error message and slowly all the jobs died
> with
> >>> >> "shortRead" errors.
> >>> >> I am not sure about threading. I am using pig script to read .gz
> file
> >>> >>
> >>> >>
> >>> >>> On Fri, Apr 27, 2012 at 5:19 AM, Mohit Anchlia <
> >>> [email protected]>
> >>> >>> wrote:
> >>> >>> > I had 20 mappers in parallel reading 20 gz files and each file
> >>> around
> >>> >>> > 30-40MB data over 5 hadoop nodes and then writing to the
> analytics
> >>> >>> > database. Almost midway it started to get this error:
> >>> >>> >
> >>> >>> >
> >>> >>> > 2012-04-26 16:13:53,723 [Thread-8] INFO
> >>> >>> org.apache.hadoop.hdfs.DFSClient -
> >>> >>> > Exception in createBlockOutputStream
> >>> >>> > 17.18.62.192:50010java.io.IOException: Bad connect ack with
> >>> >>>  > firstBadLink as
> >>> >>> > 17.18.62.191:50010
> >>> >>> >
> >>> >>> > I am trying to look at the logs but doesn't say much. What could
> be
> >>> >>>the
> >>> >>> > reason? We are in pretty closed reliable network and all machines
> >>> are
> >>> >>> up.
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> --
> >>> >>> Harsh J
> >>> >>>
> >>> >>
> >>> >>
> >>>
> >>>
> >>
>
>
>
> --
> Harsh J
>

Reply via email to