AngersZhuuuu opened a new pull request #756: [HDFS-14437]Fix BUG mentionted in HDFS-14437 URL: https://github.com/apache/hadoop/pull/756 For the bug of EditLog rolling mentioned in https://issues.apache.org/jira/browse/HDFS-10943 I have tell the root cause of it in jira's comment. https://issues.apache.org/jira/browse/HDFS-14437 In the code of #logSync() this #wait ``` while (mytxid > synctxid && isSyncRunning) { try { wait(1000); } catch (InterruptedException ie) { } } ``` when #endCurrentLogSegment call #logSync() if #isSyncRunning == true and mytxid > synctxid, Current thread will call #wait, other thread will run. if other thread can't run , #isSyncRunning will always be true. current thread can't run out of the while loop this will become a dead lock. If other thread get lock to run, They can do many things in 1000ms. Then other thread call logSync will end the flush process. synctxid may be bigger than mytxid, then it will just return in the code : ``` if (mytxid <= synctxid) { numTransactionsBatchedInSync++; if (metrics != null) { // Metrics is non-null only when used inside name node metrics.incrTransactionsBatchedInSync(); } return; } ``` When this time you close the JournalSet's OutPutStream, it will trigger the bug. What I change is to add a control of case of close, always when wait() stop or been notified by other thread(when other thread finish logSync()), I make mytxid to be the max transaction Id. Then this bug will not happen. So , the lock control is not correct.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
