Hi, In the past few weeks we evaluated and partially migrated from Hadoop 0.20.203.0 to 0.22.0. Most stuff works fine locally and simple jobs do well on the cluster. However, the most essential part of Nutch, the fetcher, seems to be very unstable on 0.22.0. In every crawl i can no be almost certain that at least some mappers mysteriously freeze and eventually time out. Other mappers are killed straight away or after a few minutes because of OOM errors. Memory consumption is also a lot higher on 0.22.0.
Right now we have three clusters, an old 0.20.203 cluster and the unstable 0.22.0 and a 0.20.205 running on the same new cluster. When we run identical jobs on all three clusters 0.22.0 almost always fails, eating RAM and occasionally freezing a mapper. Stack traces of those mappers show all threads are blocked and sometimes we see jstack unable to print deadlocks (null). I tried many settings for 0.22.0 and very conservative settings for Nutch such as few threads to spare resources (which are abundant actually) but i cannot seem to find the issue. The fetcher job still uses the old mapred API. I'd like to present a better issue report but i don't know what component in all this mess is actually responsible. It looks like the tasktracker but i'm unsure. If anyone can point us in the right direction so we can find the issue and assist in fixing it that would be great. Thanks
