[
https://issues.apache.org/jira/browse/HADOOP-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788967#comment-13788967
]
Roland von Herget commented on HADOOP-8640:
-------------------------------------------
Seems to affect 1.2.1, during normal 'hadoop fs -put', too.
from datanode.log:
{quote}
2013-10-08 08:20:27,288 WARN org.apache.hadoop.util.Shell: Could not get disk
usage information
org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access
`/..../hdfs/datanode/blocksBeingWritten/blk_2086885445451145306': No such file
or directory
at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.fs.DU.access$200(DU.java:29)
at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:84)
at java.lang.Thread.run(Thread.java:662)
{quote}
output on the console:
{quote}
13/10/08 08:20:27 INFO hdfs.DFSClient: Exception in createBlockOutputStream
192.168.x.y:50010 java.io.EOFException
13/10/08 08:20:27 INFO hdfs.DFSClient: Abandoning blk_-605554355196703343_69209
13/10/08 08:20:27 INFO hdfs.DFSClient: Excluding datanode 192.168.x.y:50010
{quote}
> DU thread transient failures propagate to callers
> -------------------------------------------------
>
> Key: HADOOP-8640
> URL: https://issues.apache.org/jira/browse/HADOOP-8640
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs, io
> Affects Versions: 2.0.0-alpha, 1.2.1
> Reporter: Todd Lipcon
>
> When running some stress tests, I saw a failure where the DURefreshThread
> failed due to the filesystem changing underneath it:
> {code}
> org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access
> `/data/4/dfs/dn/current/BP-1928785663-172.20.90.20-1343880685858/current/rbw/blk_4637779214690837894':
> No such file or directory
> {code}
> (the block was probably finalized while the du process was running, which
> caused it to fail)
> The next block write, then, called {{getUsed()}}, and the exception got
> propagated causing the write to fail. Since it was a pseudo-distributed
> cluster, the client was unable to pick a different node to write to and
> failed.
> The current behavior of propagating the exception to the next (and only the
> next) caller doesn't seem well-thought-out.
--
This message was sent by Atlassian JIRA
(v6.1#6144)