[ 
https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486155#comment-14486155
 ] 

Colin Patrick McCabe commented on HADOOP-11802:
-----------------------------------------------

I thought about this a little bit more, and I wonder whether this finally block 
inside requestShortCircuitShm is causing a "double removal":

{code}
  public void requestShortCircuitShm(String clientName) throws IOException {    
                                         
    NewShmInfo shmInfo = null;                                                  
                                         
    boolean success = false;                                                    
                                         
    DomainSocket sock = peer.getDomainSocket();                                 
                                         
    try {                                                                       
                                         
...
    } finally {                                                                 
                                         
...
      if ((!success) && (peer == null)) {
        // If we failed to pass the shared memory segment to the client,        
                                         
        // close the UNIX domain socket now.  This will trigger the             
                                         
        // DomainSocketWatcher callback, cleaning up the segment.               
                                         
        IOUtils.cleanup(null, sock);                                            
                                         
      }
      IOUtils.cleanup(null, shmInfo);                                           
                                         
    }                                                                           
                                         
{code}

Closing the socket will remove that shmID, but so will closing the NewShmInfo 
object... let me look into this.

> DomainSocketWatcher#watcherThread can encounter IllegalStateException in 
> finally block when calling sendCallback
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11802
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11802
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Eric Payne
>            Assignee: Eric Payne
>
> In the main finally block of the {{DomainSocketWatcher#watcherThread}}, the 
> call to {{sendCallback}} can encounter an {{IllegalStateException}}, and 
> leave some cleanup tasks undone.
> {code}
>       } finally {
>         lock.lock();
>         try {
>           kick(); // allow the handler for notificationSockets[0] to read a 
> byte
>           for (Entry entry : entries.values()) {
>             // We do not remove from entries as we iterate, because that can
>             // cause a ConcurrentModificationException.
>             sendCallback("close", entries, fdSet, entry.getDomainSocket().fd);
>           }
>           entries.clear();
>           fdSet.close();
>         } finally {
>           lock.unlock();
>         }
>       }
> {code}
> The exception causes {{watcherThread}} to skip the calls to 
> {{entries.clear()}} and {{fdSet.close()}}.
> {code}
> 2015-04-02 11:48:09,941 [DataXceiver for client 
> unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO 
> DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1, src: 
> 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID: 
> e6b6cdd7-1bf8-415f-a412-32d8493554df, success: false
> 2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher: 
> Thread[Thread-14,5,main] terminating on unexpected exception
> java.lang.IllegalStateException: failed to remove 
> b845649551b6b1eab5c17f630e42489d
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:145)
>         at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119)
>         at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522)
>         at java.lang.Thread.run(Thread.java:722)
> {code}
> Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or 
> HADOOP-10404. The cluster installation is running code with all of these 
> fixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to