[ 
https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485289#comment-14485289
 ] 

Eric Payne commented on HADOOP-11802:
-------------------------------------

Thanks [~cmccabe] for your comment and interest in this issue.

This problem is happening in multiple different live clusters. Only a small 
percentage of datanodes are affected each day, but once they hit this and the 
threads pile up, the datanodes must be restarted.

The only 'terminating on' message in the DN log is coming from 
DomainSocketWatchers unhandled exception handler. That is, it's the one 
documented in the description above:
{quote}
{noformat}
2015-04-04 13:12:31,059 [Thread-12] ERROR unix.DomainSocketWatcher: 
Thread[Thread-12,5,main] terminating on unexpected exception
java.lang.IllegalStateException: failed to remove 
17e33191fa8238098d7d22142f5787e2
2015-04-02 11:48:09,941 [DataXceiver for client 
unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO 
DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1, src: 
127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID: 
e6b6cdd7-1bf8-415f-a412-32d8493554df, success: false
2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher: 
Thread[Thread-14,5,main] terminating on unexpected exception
java.lang.IllegalStateException: failed to remove 
b845649551b6b1eab5c17f630e42489d
...
{noformat}
{quote}
However, as you pointed out, that is happening after something went wrong in 
the main try block of the watcher thread. Since I'm seeing neither 'terminating 
on InterruptedException' nor 'terminating on IOException', there must be some 
other exception occurring. However, the only reference in the DN log of 
{{DomainSocketWatcher}} is in the stack trace already mentioned.

However, just above the IllegalStateException stacktrace is the following that 
indicated a premature EOF occurred. There were several of these, but it's not 
clear that they are related to the reason why the DomainSocketWatcher exited.
Your input would be greatly appreciated.
{noformat}
2015-04-02 11:48:09,885 [DataXceiver for client 
DFSClient_attempt_1427231924849_569467_m_000135_0_346288762_1 at 
/xxx.xxx.xxx.xxx:41908 [Receiving block 
BP-658831282-xxx.xxx.xxx.xxx-1351509219914:blk_3365919992_1105804585360]] ERROR 
datanode.DataNode: gsta70851.tan.ygrid.yahoo.com:1004:DataXceiver error 
processing WRITE_BLOCK operation  src: /xxx.xxx.xxx.xxx:41908 dst: 
/xxx.xxx.xxx.xxx:1004
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:730)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
        at java.lang.Thread.run(Thread.java:722)
{noformat}

> DomainSocketWatcher#watcherThread can encounter IllegalStateException in 
> finally block when calling sendCallback
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11802
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11802
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Eric Payne
>            Assignee: Eric Payne
>
> In the main finally block of the {{DomainSocketWatcher#watcherThread}}, the 
> call to {{sendCallback}} can encounter an {{IllegalStateException}}, and 
> leave some cleanup tasks undone.
> {code}
>       } finally {
>         lock.lock();
>         try {
>           kick(); // allow the handler for notificationSockets[0] to read a 
> byte
>           for (Entry entry : entries.values()) {
>             // We do not remove from entries as we iterate, because that can
>             // cause a ConcurrentModificationException.
>             sendCallback("close", entries, fdSet, entry.getDomainSocket().fd);
>           }
>           entries.clear();
>           fdSet.close();
>         } finally {
>           lock.unlock();
>         }
>       }
> {code}
> The exception causes {{watcherThread}} to skip the calls to 
> {{entries.clear()}} and {{fdSet.close()}}.
> {code}
> 2015-04-02 11:48:09,941 [DataXceiver for client 
> unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO 
> DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1, src: 
> 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID: 
> e6b6cdd7-1bf8-415f-a412-32d8493554df, success: false
> 2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher: 
> Thread[Thread-14,5,main] terminating on unexpected exception
> java.lang.IllegalStateException: failed to remove 
> b845649551b6b1eab5c17f630e42489d
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:145)
>         at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119)
>         at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522)
>         at java.lang.Thread.run(Thread.java:722)
> {code}
> Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or 
> HADOOP-10404. The cluster installation is running code with all of these 
> fixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to