I should add that the failing tasks that ran concurrently all read the same map files from HDFS.
> Hi, > > We just ran run large scale Apache Nutch jobs in our evaluation of 20.205.0 > and they all failed. Some of these jobs ran concurrently with the fair > scheduler enabled. These were simple jobs consuming little RAM. I double > checked and there were certainly no RAM issues. > > All jobs failed and most tasks had a less than descriptive message. A few > told they dealt with I/O errors reading task output. However, the data the > read is fine. When we ran the same jobs manually (and some concurrently) > some did fine and others died for with I/O errors reading task output > again! > > The heap allocation for the reducers is not high but no OOM's were > reported. Besides the occasional I/O error, which i think is strange > enough, most tasks did not write anything to the logs that i can link to > this problem. > > We do not see this happening on our 20.203.0 cluster although resources and > settings are different. 205 is a new high-end cluster with similar > conservative settings but only more mappers/reducers per node. Resource > settings are almost identical. The 203 cluster has three times as many > machines so also more open file descriptors and threads. > > Any thoughts to share? > Thanks,
