I should add that the failing tasks that ran concurrently all read the same 
map files from HDFS.

> Hi,
> 
> We just ran run large scale Apache Nutch jobs in our evaluation of 20.205.0
> and they all failed. Some of these jobs ran concurrently with the fair
> scheduler enabled. These were simple jobs consuming little RAM. I double
> checked and there were certainly no RAM issues.
> 
> All jobs failed and most tasks had a less than descriptive message. A few
> told they dealt with I/O errors reading task output. However, the data the
> read is fine. When we ran the same jobs manually (and some concurrently)
> some did fine and others died for with I/O errors reading task output
> again!
> 
> The heap allocation for the reducers is not high but no OOM's were
> reported. Besides the occasional I/O error, which i think is strange
> enough, most tasks did not write anything to the logs that i can link to
> this problem.
> 
> We do not see this happening on our 20.203.0 cluster although resources and
> settings are different. 205 is a new high-end cluster with similar
> conservative settings but only more mappers/reducers per node. Resource
> settings are almost identical. The 203 cluster has three times as many
> machines so also more open file descriptors and threads.
> 
> Any thoughts to share?
> Thanks,

Reply via email to