[
https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494364#comment-14494364
]
Colin Patrick McCabe commented on HADOOP-11802:
-----------------------------------------------
Thanks for following up, [~eepayne].
{code}
java.net.SocketException: write(2) error: Broken pipe
at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
at
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
{code}
This error means that the socket was closed by the remote end. This is not
surprising since there was a really long GC, and the client read operation
timed out.
bq. Then, the main DomainSocketWatcher thread wakes up (after regular timeout
interval has expired), and tries to call sendCallbackAndRemove
Small correction, {{DomainSocketWatcher}} is event-triggered rather than
timeout triggered. The only timeout we have is so we can check if someone sent
a Java {{InterruptedException}}.
{code}
ERROR unix.DomainSocketWatcher:
org.apache.hadoop.net.unix.DomainSocketWatcher$2@76845081
terminating on Throwable
java.lang.IllegalArgumentException: DomainSocketWatcher(103231254): file
descriptor 249 was closed
while still in the poll(2) loop.
at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
{code}
This is the root cause. {{DomainSocket#close}} is not supposed to be closed
while the socket is in the poll(2) loop. Another file descriptor could be
opened and get the same number, which would cause bad behavior. I can see now
that the call to {{DomainSocket#close}} in DataXceiver is a mistake.
> DomainSocketWatcher#watcherThread can encounter IllegalStateException in
> finally block when calling sendCallback
> ----------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-11802
> URL: https://issues.apache.org/jira/browse/HADOOP-11802
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Eric Payne
> Assignee: Eric Payne
>
> In the main finally block of the {{DomainSocketWatcher#watcherThread}}, the
> call to {{sendCallback}} can encounter an {{IllegalStateException}}, and
> leave some cleanup tasks undone.
> {code}
> } finally {
> lock.lock();
> try {
> kick(); // allow the handler for notificationSockets[0] to read a
> byte
> for (Entry entry : entries.values()) {
> // We do not remove from entries as we iterate, because that can
> // cause a ConcurrentModificationException.
> sendCallback("close", entries, fdSet, entry.getDomainSocket().fd);
> }
> entries.clear();
> fdSet.close();
> } finally {
> lock.unlock();
> }
> }
> {code}
> The exception causes {{watcherThread}} to skip the calls to
> {{entries.clear()}} and {{fdSet.close()}}.
> {code}
> 2015-04-02 11:48:09,941 [DataXceiver for client
> unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO
> DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1, src:
> 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID:
> e6b6cdd7-1bf8-415f-a412-32d8493554df, success: false
> 2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher:
> Thread[Thread-14,5,main] terminating on unexpected exception
> java.lang.IllegalStateException: failed to remove
> b845649551b6b1eab5c17f630e42489d
> at
> com.google.common.base.Preconditions.checkState(Preconditions.java:145)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522)
> at java.lang.Thread.run(Thread.java:722)
> {code}
> Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or
> HADOOP-10404. The cluster installation is running code with all of these
> fixes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)