[ 
https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494364#comment-14494364
 ] 

Colin Patrick McCabe commented on HADOOP-11802:
-----------------------------------------------

Thanks for following up, [~eepayne].

{code}
java.net.SocketException: write(2) error: Broken pipe
        at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
        at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
{code}
This error means that the socket was closed by the remote end.  This is not 
surprising since there was a really long GC, and the client read operation 
timed out.

bq. Then, the main DomainSocketWatcher thread wakes up (after regular timeout 
interval has expired), and tries to call sendCallbackAndRemove

Small correction, {{DomainSocketWatcher}} is event-triggered rather than 
timeout triggered.  The only timeout we have is so we can check if someone sent 
a Java {{InterruptedException}}.

{code}
ERROR unix.DomainSocketWatcher: 
org.apache.hadoop.net.unix.DomainSocketWatcher$2@76845081
      terminating on Throwable
java.lang.IllegalArgumentException: DomainSocketWatcher(103231254): file 
descriptor 249 was closed
      while still in the poll(2) loop.
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
{code}
This is the root cause.  {{DomainSocket#close}} is not supposed to be closed 
while the socket is in the poll(2) loop.  Another file descriptor could be 
opened and get the same number, which would cause bad behavior.  I can see now 
that the call to {{DomainSocket#close}} in DataXceiver is a mistake.

> DomainSocketWatcher#watcherThread can encounter IllegalStateException in 
> finally block when calling sendCallback
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11802
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11802
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Eric Payne
>            Assignee: Eric Payne
>
> In the main finally block of the {{DomainSocketWatcher#watcherThread}}, the 
> call to {{sendCallback}} can encounter an {{IllegalStateException}}, and 
> leave some cleanup tasks undone.
> {code}
>       } finally {
>         lock.lock();
>         try {
>           kick(); // allow the handler for notificationSockets[0] to read a 
> byte
>           for (Entry entry : entries.values()) {
>             // We do not remove from entries as we iterate, because that can
>             // cause a ConcurrentModificationException.
>             sendCallback("close", entries, fdSet, entry.getDomainSocket().fd);
>           }
>           entries.clear();
>           fdSet.close();
>         } finally {
>           lock.unlock();
>         }
>       }
> {code}
> The exception causes {{watcherThread}} to skip the calls to 
> {{entries.clear()}} and {{fdSet.close()}}.
> {code}
> 2015-04-02 11:48:09,941 [DataXceiver for client 
> unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO 
> DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1, src: 
> 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID: 
> e6b6cdd7-1bf8-415f-a412-32d8493554df, success: false
> 2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher: 
> Thread[Thread-14,5,main] terminating on unexpected exception
> java.lang.IllegalStateException: failed to remove 
> b845649551b6b1eab5c17f630e42489d
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:145)
>         at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119)
>         at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522)
>         at java.lang.Thread.run(Thread.java:722)
> {code}
> Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or 
> HADOOP-10404. The cluster installation is running code with all of these 
> fixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to