[
https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485289#comment-14485289
]
Eric Payne commented on HADOOP-11802:
-------------------------------------
Thanks [~cmccabe] for your comment and interest in this issue.
This problem is happening in multiple different live clusters. Only a small
percentage of datanodes are affected each day, but once they hit this and the
threads pile up, the datanodes must be restarted.
The only 'terminating on' message in the DN log is coming from
DomainSocketWatchers unhandled exception handler. That is, it's the one
documented in the description above:
{quote}
{noformat}
2015-04-04 13:12:31,059 [Thread-12] ERROR unix.DomainSocketWatcher:
Thread[Thread-12,5,main] terminating on unexpected exception
java.lang.IllegalStateException: failed to remove
17e33191fa8238098d7d22142f5787e2
2015-04-02 11:48:09,941 [DataXceiver for client
unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO
DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1, src:
127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID:
e6b6cdd7-1bf8-415f-a412-32d8493554df, success: false
2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher:
Thread[Thread-14,5,main] terminating on unexpected exception
java.lang.IllegalStateException: failed to remove
b845649551b6b1eab5c17f630e42489d
...
{noformat}
{quote}
However, as you pointed out, that is happening after something went wrong in
the main try block of the watcher thread. Since I'm seeing neither 'terminating
on InterruptedException' nor 'terminating on IOException', there must be some
other exception occurring. However, the only reference in the DN log of
{{DomainSocketWatcher}} is in the stack trace already mentioned.
However, just above the IllegalStateException stacktrace is the following that
indicated a premature EOF occurred. There were several of these, but it's not
clear that they are related to the reason why the DomainSocketWatcher exited.
Your input would be greatly appreciated.
{noformat}
2015-04-02 11:48:09,885 [DataXceiver for client
DFSClient_attempt_1427231924849_569467_m_000135_0_346288762_1 at
/xxx.xxx.xxx.xxx:41908 [Receiving block
BP-658831282-xxx.xxx.xxx.xxx-1351509219914:blk_3365919992_1105804585360]] ERROR
datanode.DataNode: gsta70851.tan.ygrid.yahoo.com:1004:DataXceiver error
processing WRITE_BLOCK operation src: /xxx.xxx.xxx.xxx:41908 dst:
/xxx.xxx.xxx.xxx:1004
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:730)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
at java.lang.Thread.run(Thread.java:722)
{noformat}
> DomainSocketWatcher#watcherThread can encounter IllegalStateException in
> finally block when calling sendCallback
> ----------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-11802
> URL: https://issues.apache.org/jira/browse/HADOOP-11802
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Eric Payne
> Assignee: Eric Payne
>
> In the main finally block of the {{DomainSocketWatcher#watcherThread}}, the
> call to {{sendCallback}} can encounter an {{IllegalStateException}}, and
> leave some cleanup tasks undone.
> {code}
> } finally {
> lock.lock();
> try {
> kick(); // allow the handler for notificationSockets[0] to read a
> byte
> for (Entry entry : entries.values()) {
> // We do not remove from entries as we iterate, because that can
> // cause a ConcurrentModificationException.
> sendCallback("close", entries, fdSet, entry.getDomainSocket().fd);
> }
> entries.clear();
> fdSet.close();
> } finally {
> lock.unlock();
> }
> }
> {code}
> The exception causes {{watcherThread}} to skip the calls to
> {{entries.clear()}} and {{fdSet.close()}}.
> {code}
> 2015-04-02 11:48:09,941 [DataXceiver for client
> unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO
> DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1, src:
> 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID:
> e6b6cdd7-1bf8-415f-a412-32d8493554df, success: false
> 2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher:
> Thread[Thread-14,5,main] terminating on unexpected exception
> java.lang.IllegalStateException: failed to remove
> b845649551b6b1eab5c17f630e42489d
> at
> com.google.common.base.Preconditions.checkState(Preconditions.java:145)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522)
> at java.lang.Thread.run(Thread.java:722)
> {code}
> Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or
> HADOOP-10404. The cluster installation is running code with all of these
> fixes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)