Re: I/O errors reading task output on 20.205.0

Harsh J Fri, 30 Dec 2011 05:13:38 -0800

Yes your .205 release should have it. It should fix your issue!

On Fri, Dec 30, 2011 at 6:24 PM, Markus Jelsma
<[email protected]> wrote:
> Hi, (didn't reply to list before)
>
>> Does your DN log show up any form of errors when you run into this?
>
> Actually, i looked checked again to be sure and noticed errors that i didn't
> notice before:
>
> 2011-12-29 19:51:01,799 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(141.105.120.152:50010,
> storageID=DS-454617998-141.105.120.152-50010-1324646606851, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers
> 256
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:92)
>        at java.lang.Thread.run(Thread.java:662)
>
> but also this one:
>
> 2011-12-29 19:51:00,675 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(141.105.120.152:50010,
> storageID=DS-454617998-141.105.120.152-50010-1324646606851, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException
>        at java.io.DataInputStream.readShort(DataInputStream.java:298)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:351)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>        at java.lang.Thread.run(Thread.java:662)
>
>
>> This happens with just two jobs reading how many files? And how many
>> DNs are these spread across?
>
> One file, 15 parts spread across five machines.
>
>>
>> I'm thinking its probably something to do with your ulimits for the
>> running DN processes, but I can't say for sure without taking a look
>> at the logs.
>
> Ulimits for open files is set to 16k for all machines.
>
>>
>> Some other stuff I can think of, a little blindly:
>> - What's your dfs.datanode.max.xcievers settings?
>
> I don't know. I increased it for a 0.22.0 test cluster but this is 0.20.205.0
> and i haven't seen that configuration directive in the manual for this
> version. At least not in the hdfs-, core or mapred-default files.
>
>> - Can you ensure 'hadoop classpath' on all nodes reflects the same
>> output, and no accidental jar mixups?
>
> They are identical. All machines were installed and configured automatically
> and looking at it i don't see any differences.
>
> Is there such a max.xceivers setting in the 0.20.x branch? Judging from the
> exception it might be that's the problem.
>
> Thanks!
>
>
>
>
>
>> Does your DN log show up any form of errors when you run into this?
>> This happens with just two jobs reading how many files? And how many
>> DNs are these spread across?
>>
>> I'm thinking its probably something to do with your ulimits for the
>> running DN processes, but I can't say for sure without taking a look
>> at the logs.
>>
>> Some other stuff I can think of, a little blindly:
>> - What's your dfs.datanode.max.xcievers settings?
>> - Can you ensure 'hadoop classpath' on all nodes reflects the same
>> output, and no accidental jar mixups?
>>
>> On Thu, Dec 29, 2011 at 11:48 PM, Markus Jelsma
>>
>> <[email protected]> wrote:
>> > We just reproduced it (somehow) by running jobs concurrently reading the
>> > same data. Two out of three similar jobs died early in the map phase
>> > with Could not obtain block errors, one finished completely.
>> >
>> > java.io.IOException: Could not obtain block: blk_119146860335302651_13067
>> > file=/user/systems/crawl/crawldb/current/part-00000/data
>> >        at
>> > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.
>> > java:2093) at
>> > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.jav
>> > a:1897) at
>> > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2048)
>> >        at java.io.DataInputStream.readFully(DataInputStream.java:178)
>> >        at
>> > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:
>> > 63) at
>> > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>> > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>> > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2069)
>> > at
>> > org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyVal
>> > ue(SequenceFileRecordReader.java:68) at
>> > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(Map
>> > Task.java:532) at
>> > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at
>> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
>> > org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at
>> > org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> >        at java.security.AccessController.doPrivileged(Native Method)
>> >        at javax.security.auth.Subject.doAs(Subject.java:396)
>> >        at
>> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
>> > .java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> >
>> > Another job (different but reading the same data) finished the map phase
>> > but died partially (half of the reducers) and completely frooze.
>> >
>> > 2011-12-29 18:07:58,899 INFO org.apache.hadoop.hdfs.DFSClient: Exception
>> > in createBlockOutputStream java.io.EOFException
>> > 2011-12-29 18:07:58,899 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
>> > block blk_4748641522370871094_13532
>> > 2011-12-29 18:07:58,900 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
>> > datanode 141.105.120.154:50010
>> > 2011-12-29 18:07:58,902 INFO org.apache.hadoop.hdfs.DFSClient: Exception
>> > in createBlockOutputStream java.io.EOFException
>> > 2011-12-29 18:07:58,902 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
>> > block blk_-1454920600140944030_13532
>> > 2011-12-29 18:07:58,903 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
>> > datanode 141.105.120.152:50010
>> > 2011-12-29 18:07:58,907 INFO org.apache.hadoop.hdfs.DFSClient: Exception
>> > in createBlockOutputStream java.io.IOException: Bad connect ack with
>> > firstBadLink as 141.105.120.153:50010
>> > 2011-12-29 18:07:58,907 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
>> > block blk_3551418605384221738_13532
>> > 2011-12-29 18:07:58,908 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
>> > datanode 141.105.120.153:50010
>> > 2011-12-29 18:07:58,910 INFO org.apache.hadoop.hdfs.DFSClient: Exception
>> > in createBlockOutputStream java.io.EOFException
>> > 2011-12-29 18:07:58,910 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
>> > block blk_-1826030182013954555_13532
>> > 2011-12-29 18:07:58,911 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
>> > datanode 141.105.120.150:50010
>> > 2011-12-29 18:07:58,911 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DataStreamer Exception: java.io.IOException: Unable to create new block.
>> >        at
>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DF
>> > SClient.java:3213) at
>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.ja
>> > va:2406) at
>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie
>> > nt.java:2646)
>> >
>> > 2011-12-29 18:07:58,912 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-1826030182013954555_13532 bad datanode[0] nodes
>> > == null 2011-12-29 18:07:58,912 WARN org.apache.hadoop.hdfs.DFSClient:
>> > Could not get block locations. Source file "/user/systems/generate-
>> > temp-1325180944829/_temporary/_attempt_201112290956_0012_r_000004_0/fetch
>> > list-13/part-00004" - Aborting...
>> > 2011-12-29 18:07:59,049 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
>> > Initializing logs' truncater with mapRetainSize=-1 and
>> > reduceRetainSize=-1 2011-12-29 18:07:59,062 WARN
>> > org.apache.hadoop.mapred.Child: Error running child
>> > java.io.EOFException
>> >        at java.io.DataInputStream.readShort(DataInputStream.java:298)
>> >        at
>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(
>> > DFSClient.java:3272) at
>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DF
>> > SClient.java:3196) at
>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.ja
>> > va:2406) at
>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie
>> > nt.java:2646) 2011-12-29 18:07:59,064 INFO org.apache.hadoop.mapred.Task:
>> > Runnning cleanup for the task
>> >
>> > It smells like the datanodes in 20.205.0 don't deal well with concurrent
>> > jobs, especially handling the same data.
>> >
>> > Is there any advice for this? Again, this does not happen on 20.203.0.
>> > Many thanks
>> >
>> >> I should add that the failing tasks that ran concurrently all read the
>> >> same map files from HDFS.
>> >>
>> >> > Hi,
>> >> >
>> >> > We just ran run large scale Apache Nutch jobs in our evaluation of
>> >> > 20.205.0 and they all failed. Some of these jobs ran concurrently with
>> >> > the fair scheduler enabled. These were simple jobs consuming little
>> >> > RAM. I double checked and there were certainly no RAM issues.
>> >> >
>> >> > All jobs failed and most tasks had a less than descriptive message. A
>> >> > few told they dealt with I/O errors reading task output. However, the
>> >> > data the read is fine. When we ran the same jobs manually (and some
>> >> > concurrently) some did fine and others died for with I/O errors
>> >> > reading task output again!
>> >> >
>> >> > The heap allocation for the reducers is not high but no OOM's were
>> >> > reported. Besides the occasional I/O error, which i think is strange
>> >> > enough, most tasks did not write anything to the logs that i can link
>> >> > to this problem.
>> >> >
>> >> > We do not see this happening on our 20.203.0 cluster although
>> >> > resources and settings are different. 205 is a new high-end cluster
>> >> > with similar conservative settings but only more mappers/reducers per
>> >> > node. Resource settings are almost identical. The 203 cluster has
>> >> > three times as many machines so also more open file descriptors and
>> >> > threads.
>> >> >
>> >> > Any thoughts to share?
>> >> > Thanks,




-- 
Harsh J

Re: I/O errors reading task output on 20.205.0

Reply via email to