It sounds to me like you're running out of DN xceivers. Try the
solution offered at
http://hbase.apache.org/book.html#dfs.datanode.max.xcievers

I.e., add:

<property>
        <name>dfs.datanode.max.xcievers</name>
        <value>4096</value>
 </property>

To your DNs' config/hdfs-site.xml and restart the DNs.

On Mon, Apr 30, 2012 at 1:35 AM, Mohit Anchlia <[email protected]> wrote:
> I even tried to lower number of parallel jobs even further but I still get
> these errors. Any suggestion on how to troubleshoot this issue would be
> very helpful. Should I run hadoop fsck? How do people troubleshoot such
> issues?? Does it sound like a bug?
>
> 2012-04-27 14:37:42,921 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission.
> 2012-04-27 14:37:42,931 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream 125.18.62.199:50010java.io.EOFException
> 2012-04-27 14:37:42,932 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_6343044536824463287_24619
> 2012-04-27 14:37:42,932 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Excluding datanode 125.18.62.199:50010
> 2012-04-27 14:37:42,935 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream 125.18.62.204:50010java.io.EOFException
> 2012-04-27 14:37:42,935 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_2837215798109471362_24620
> 2012-04-27 14:37:42,936 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Excluding datanode 125.18.62.204:50010
> 2012-04-27 14:37:42,937 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission.
> 2012-04-27 14:37:42,939 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream 125.18.62.198:50010java.io.EOFException
> 2012-04-27 14:37:42,939 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_2223489090936415027_24620
> 2012-04-27 14:37:42,940 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Excluding datanode 125.18.62.198:50010
> 2012-04-27 14:37:42,943 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream 125.18.62.197:50010java.io.EOFException
> 2012-04-27 14:37:42,943 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_1265169201875643059_24620
> 2012-04-27 14:37:42,944 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Excluding datanode 125.18.62.197:50010
> 2012-04-27 14:37:42,945 [Thread-5] WARN  org.apache.hadoop.hdfs.DFSClient -
> DataStreamer Exception: java.io.IOException: Unable to create new block.
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3446)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2627)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2822)
> 2012-04-27 14:37:42,945 [Thread-5] WARN  org.apache.hadoop.hdfs.DFSClient -
> Error Recovery for block blk_1265169201875643059_24620 bad datanode[0]
> nodes == null
> 2012-04-27 14:37:42,945 [Thread-5] WARN  org.apache.hadoop.hdfs.DFSClient -
> Could not get block locations. Source file
> "/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201204261707_0411/job.jar"
> - Aborting...
> 2012-04-27 14:37:42,945 [Thread-4] INFO  org.apache.hadoop.mapred.JobClient
> - Cleaning up the staging area
> hdfs://dsdb1:54310/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201204261707_0411
> 2012-04-27 14:37:42,945 [Thread-4] ERROR
> org.apache.hadoop.security.UserGroupInformation -
> PriviledgedActionException as:hadoop (auth:SIMPLE)
> cause:java.io.EOFException
> 2012-04-27 14:37:42,996 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream
> 125.18.62.200:50010java.io.IOException: Bad connect ack with
> firstBadLink as
> 125.18.62.198:50010
> 2012-04-27 14:37:42,996 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_-7583284266913502018_24621
> 2012-04-27 14:37:42,997 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream 125.18.62.198:50010java.io.EOFException
> 2012-04-27 14:37:42,997 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_4207260385919079785_24622
> 2012-04-27 14:37:42,998 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Excluding datanode 125.18.62.198:50010
> 2012-04-27 14:37:43,000 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Excluding datanode 125.18.62.198:50010
> 2012-04-27 14:37:43,002 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream 125.18.62.197:50010java.io.EOFException
> 2012-04-27 14:37:43,002 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_-2859304645525022496_24624
> 2012-04-27 14:37:43,003 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Excluding datanode 125.18.62.197:50010
> 2012-04-27 14:37:43,003 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream 125.18.62.198:50010java.io.EOFException
> 2012-04-27 14:37:43,004 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_-5091361633954135154_24622
> 2012-04-27 14:37:43,004 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream 125.18.62.199:50010java.io.EOFException
> 2012-04-27 14:37:43,004 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_-1445223397912067500_24624
> 2012-04-27 14:37:43,005 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Excluding datanode 125.18.62.198:50010
> 2012-04-27 14:37:43,005 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Excluding datanode 125.18.62.199:50010
> 2012-04-27 14:37:43,006 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream 125.18.62.204:50010java.io.EOFException
> 2012-04-27 14:37:43,006 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_4137744363907213546_24624
> 2012-04-27 14:37:43,007 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Excluding datanode 125.18.62.204:50010
> 2012-04-27 14:37:43,008 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream 125.18.62.204:50010java.io.EOFException
> 2012-04-27 14:37:43,008 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_4553692535678376597_24624
> 2012-04-27 14:37:43,008 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Exception in createBlockOutputStream 125.18.62.197:50010java.io.EOFException
> 2012-04-27 14:37:43,008 [Thread-5] INFO  org.apache.hadoop.hdfs.DFSClient -
> Abandoning block blk_-7407489373889053706_24624
>
>
> On Fri, Apr 27, 2012 at 3:45 PM, Mohit Anchlia <[email protected]>wrote:
>
>> After all the jobs fail I can't run anything. Once I restart the cluster I
>> am able to run other jobs with no problems, hadoop fs and other io
>> intensive jobs run just fine.
>>
>>
>> On Fri, Apr 27, 2012 at 3:12 PM, John George <[email protected]>wrote:
>>
>>> Can you run a regular 'hadoop fs' (put orls or get) command?
>>> If yes, how about a wordcount example?
>>> '<path>/hadoop jar <path>hadoop-*examples*.jar wordcount input output'
>>>
>>>
>>> -----Original Message-----
>>> From: Mohit Anchlia <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Fri, 27 Apr 2012 14:36:49 -0700
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: DFSClient error
>>>
>>> >I even tried to reduce number of jobs but didn't help. This is what I
>>> see:
>>> >
>>> >datanode logs:
>>> >
>>> >Initializing secure datanode resources
>>> >Successfully obtained privileged resources (streaming port =
>>> >ServerSocket[addr=/0.0.0.0,localport=50010] ) (http listener port =
>>> >sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:50075])
>>> >Starting regular datanode initialization
>>> >26/04/2012 17:06:51 9858 jsvc.exec error: Service exit with a return
>>> value
>>> >of 143
>>> >
>>> >userlogs:
>>> >
>>> >2012-04-26 19:35:22,801 WARN
>>> >org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is
>>> >available
>>> >2012-04-26 19:35:22,801 INFO
>>> >org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library
>>> >loaded
>>> >2012-04-26 19:35:22,808 INFO
>>> >org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
>>> >initialized native-zlib library
>>> >2012-04-26 19:35:22,903 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
>>> >connect to /125.18.62.197:50010, add to deadNodes and continue
>>> >java.io.EOFException
>>> >        at java.io.DataInputStream.readShort(DataInputStream.java:298)
>>> >        at
>>>
>>> >org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.newBlockReader(DFSClien
>>> >t.java:1664)
>>> >        at
>>>
>>> >org.apache.hadoop.hdfs.DFSClient$DFSInputStream.getBlockReader(DFSClient.j
>>> >ava:2383)
>>> >        at
>>>
>>> >org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java
>>> >:2056)
>>> >        at
>>> >org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2170)
>>> >        at java.io.DataInputStream.read(DataInputStream.java:132)
>>> >        at
>>>
>>> >org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(Decompr
>>> >essorStream.java:97)
>>> >        at
>>>
>>> >org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorSt
>>> >ream.java:87)
>>> >        at
>>>
>>> >org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.j
>>> >ava:75)
>>> >        at java.io.InputStream.read(InputStream.java:85)
>>> >        at
>>> >org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205)
>>> >        at
>>> org.apache.hadoop.util.LineReader.readLine(LineReader.java:169)
>>> >        at
>>>
>>> >org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRe
>>> >cordReader.java:114)
>>> >        at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109)
>>> >        at
>>>
>>> >org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordRead
>>> >er.nextKeyValue(PigRecordReader.java:187)
>>> >        at
>>>
>>> >org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapT
>>> >ask.java:456)
>>> >        at
>>> >org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>>> >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>>> >        at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
>>> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>>> >        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>>> >        at java.security.AccessController.doPrivileged(Native Method)
>>> >        at javax.security.auth.Subject.doAs(Subject.java:396)
>>> >        at
>>>
>>> >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
>>> >java:1157)
>>> >        at org.apache.hadoop.mapred.Child.main(Child.java:264)
>>> >2012-04-26 19:35:22,906 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
>>> >connect to /125.18.62.204:50010, add to deadNodes and continue
>>> >java.io.EOFException
>>> >
>>> >namenode logs:
>>> >
>>> >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: Job
>>> >job_201204261140_0244 added successfully for user 'hadoop' to queue
>>> >'default'
>>> >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker:
>>> >Initializing job_201204261140_0244
>>> >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.AuditLogger:
>>> >USER=hadoop  IP=125.18.62.196        OPERATION=SUBMIT_JOB
>>> >TARGET=job_201204261140_0244    RESULT=SUCCESS
>>> >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobInProgress:
>>> >Initializing job_201204261140_0244
>>> >2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Exception
>>> >in
>>> >createBlockOutputStream 125.18.62.198:50010 java.io.IOException: Bad
>>> >connect ack with firstBadLink as 125.18.62.197:50010
>>> >2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
>>> >block blk_2499580289951080275_22499
>>> >2012-04-26 16:12:53,582 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
>>> >datanode 125.18.62.197:50010
>>> >2012-04-26 16:12:53,594 INFO org.apache.hadoop.mapred.JobInProgress:
>>> >jobToken generated and stored with users keys in
>>> >/data/hadoop/mapreduce/job_201204261140_0244/jobToken
>>> >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
>>> Input
>>> >size for job job_201204261140_0244 = 73808305. Number of splits = 1
>>> >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
>>> >tip:task_201204261140_0244_m_000000 has split on node:/default-rack/
>>> >dsdb4.corp.intuit.net
>>> >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
>>> >tip:task_201204261140_0244_m_000000 has split on node:/default-rack/
>>>  >dsdb5.corp.intuit.net
>>> >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
>>> >job_201204261140_0244 LOCALITY_WAIT_FACTOR=0.4
>>> >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress: Job
>>> >job_201204261140_0244 initialized successfully with 1 map tasks and 0
>>> >reduce tasks.
>>> >
>>> >On Fri, Apr 27, 2012 at 7:50 AM, Mohit Anchlia
>>> ><[email protected]>wrote:
>>> >
>>> >>
>>> >>
>>> >>  On Thu, Apr 26, 2012 at 10:24 PM, Harsh J <[email protected]> wrote:
>>> >>
>>> >>> Is only the same IP printed in all such messages? Can you check the DN
>>> >>> log in that machine to see if it reports any form of issues?
>>> >>>
>>> >>> All IPs were logged with this message
>>> >>
>>> >>
>>> >>> Also, did your jobs fail or kept going despite these hiccups? I notice
>>> >>> you're threading your clients though (?), but I can't tell if that may
>>> >>> cause this without further information.
>>> >>>
>>> >>> It started with this error message and slowly all the jobs died with
>>> >> "shortRead" errors.
>>> >> I am not sure about threading. I am using pig script to read .gz file
>>> >>
>>> >>
>>> >>> On Fri, Apr 27, 2012 at 5:19 AM, Mohit Anchlia <
>>> [email protected]>
>>> >>> wrote:
>>> >>> > I had 20 mappers in parallel reading 20 gz files and each file
>>> around
>>> >>> > 30-40MB data over 5 hadoop nodes and then writing to the analytics
>>> >>> > database. Almost midway it started to get this error:
>>> >>> >
>>> >>> >
>>> >>> > 2012-04-26 16:13:53,723 [Thread-8] INFO
>>> >>> org.apache.hadoop.hdfs.DFSClient -
>>> >>> > Exception in createBlockOutputStream
>>> >>> > 17.18.62.192:50010java.io.IOException: Bad connect ack with
>>> >>>  > firstBadLink as
>>> >>> > 17.18.62.191:50010
>>> >>> >
>>> >>> > I am trying to look at the logs but doesn't say much. What could be
>>> >>>the
>>> >>> > reason? We are in pretty closed reliable network and all machines
>>> are
>>> >>> up.
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Harsh J
>>> >>>
>>> >>
>>> >>
>>>
>>>
>>



-- 
Harsh J

Reply via email to