Hi, We just ran run large scale Apache Nutch jobs in our evaluation of 20.205.0 and they all failed. Some of these jobs ran concurrently with the fair scheduler enabled. These were simple jobs consuming little RAM. I double checked and there were certainly no RAM issues.
All jobs failed and most tasks had a less than descriptive message. A few told they dealt with I/O errors reading task output. However, the data the read is fine. When we ran the same jobs manually (and some concurrently) some did fine and others died for with I/O errors reading task output again! The heap allocation for the reducers is not high but no OOM's were reported. Besides the occasional I/O error, which i think is strange enough, most tasks did not write anything to the logs that i can link to this problem. We do not see this happening on our 20.203.0 cluster although resources and settings are different. 205 is a new high-end cluster with similar conservative settings but only more mappers/reducers per node. Resource settings are almost identical. The 203 cluster has three times as many machines so also more open file descriptors and threads. Any thoughts to share? Thanks,
