[
https://issues.apache.org/jira/browse/HADOOP-19387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18016946#comment-18016946
]
Yunqing Wei commented on HADOOP-19387:
--------------------------------------
We've encountered this same issue in production. Hope it can be fixed soon!
> LocalDirAllocator::getLocalPathForWrite unable to recover when the thread is
> interrupted
> -----------------------------------------------------------------------------------------
>
> Key: HADOOP-19387
> URL: https://issues.apache.org/jira/browse/HADOOP-19387
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs
> Environment: I have created a unit test -
> TestLocalDirAllocator::testInterrupt that reproduce the bug -
> [https://github.com/nirro01/hadoop/commit/042ca3587654db15e25843593464592d9b5acd42]
>
> Reporter: Nir Rozenberg
> Priority: Critical
> Labels: pull-request-available
>
> when the thread that calls getLocalPathForWrite for the first time is
> interrupted, dirty context (dirDF is empty) could be created and saved.
> if it happens, any other calls to getLocalPathForWrite with the same context
> will fail with DiskErrorException, and there is no option to recover.
> this is the stack trace when the the thread is interrupted-
> {noformat}
> java.io.InterruptedIOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:1073)
> at org.apache.hadoop.util.Shell.run(Shell.java:959)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1284)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1379)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1361)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:1116)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:798)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:838)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:810)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:837)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:810)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:837)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:810)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:988)
> at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:327)
> at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:396)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:166)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:147)
> at
> org.apache.hadoop.fs.TestLocalDirAllocator.testInterrupt(TestLocalDirAllocator.java:579)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> at org.junit.runners.Suite.runChild(Suite.java:128)
> at org.junit.runners.Suite.runChild(Suite.java:27)
> at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
> at
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
> at
> com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38)
> at
> com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11)
> at
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35)
> at
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:232)
> at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:55)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.UNIXProcess.waitFor(UNIXProcess.java:395)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:1063)
> ... 53 more
> {noformat}
> and any other calls to getLocalPathForWrite with the same context will result
> in
> DiskErrorException with the message "Could not find any valid local directory
> for file with requested size 100 as the max capacity in any directory is 0"
> or "No space available in any of the local directories" depend if the size is
> known.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]