[
https://issues.apache.org/jira/browse/HADOOP-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886734#comment-15886734
]
Joel Baranick edited comment on HADOOP-14124 at 2/27/17 11:27 PM:
------------------------------------------------------------------
Hey Steve,
Thanks for the info. I read the Hadoop Filesystem specification and it seems
like this scenario is breaking some of the specification.
First, the postcondition of the specification for {{FSDataOutputStream
create(Path, ...)}} states that the "... updated (valid) FileSystem must
contains all the parent directories of the path, as created by
mkdirs(parent(p)).". I would content that in this scenario, the opposite is
happening.
Second, the "Empty (non-root) directory" postcondition of the specification for
{{FileSystem.delete(Path P, boolean recursive)}} states that "Deleting an empty
directory that is not root will remove the path from the FS and return true.".
While this is occurring, I think that considering the a fake directory empty
even if it has another fake directory in it is incorrect. For example, on
debian, the following doesn't work.
{noformat}
[~]# mkdir job
[~]# cd job
[job]# mkdir task
[job]# cd ..
[~]# rmdir job
rmdir: failed to remove ‘job’: Directory not empty
{noformat}
Additionally, the interaction of AmazonS3Client/CyberDuck with empty
directories seems different than you described. See the following scenario:
# Open CyberDuck and connect to an S3 Bucket
# Create a folder called {{job}} in CyberDuck
# Right Click on the {{job}} folder and open +Info+. Result: _Size = 0B_ and
_S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}. Result:
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
# Navigate into the {{job}} folder in CyberDuck
# Create a folder called {{task}} in CyberDuck
# Right Click on the {{task}} folder and open +Info+. Result: _Size = 0B_ and
_S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}} Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}} Result:
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
#* _job/task/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Upload _file_ into _/job/task_ via CyberDuck
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}} Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}} Result:
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
#* _job/task/_
#* _job/task/file_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/task/"))}}. Result:
#* _s3a://bucket/job/task_ ^dir;empty^
#* _s3a://bucket/job/task/file_ ^file^
One thing to note above is the inconsistent results for
{{S3AFileSystem.listStatus(...))}}. In some cases a folder will be indicated
as empty and in other it will be not empty.
At this point, if you delete {{/job/task/file}} in CyberDuck or the AWS
Console, the {{/job}} and {{/job/task}} folders continue to exists and all
calls continue to return the same results as before (except {{/job/task/file}}
is excluded from any list results). If, on the other hand, you created
{{/job/task/file}} via S3AFileSystem, it would implicitly remove the parent
folders which it considers "empty". Then when {{/job/task/file}} is deleted,
the parent "empty" directories are also gone.
My last counterpoint to the current Hadoop behavior with regard to S3A is the
AWS S3 Console. It effectively models a filesystem despite the fact that it is
backed by a blobstore. I'm able to create nested folders, upload a file,
delete the file, and the nested "empty" folders still exist. As to the
consistency guarantees, this is solved by EMR, making even more like a true
FileSystem.
Regarding HADOOP-9565, I don't have any need or desire for that. I would
prefer that everything continues to function under the FileSystem paradigm.
The Posix API is good for me and consistency is a non-issue because of EMR,
s3mper, or HADOOP-13345 (which seems to be based on the ideas from s3mper).
Thanks!
was (Author: jbaranick):
Hey Steve,
Thanks for the info. I read the Hadoop Filesystem specification and it seems
like this scenario is breaking some of the specification.
First, the postcondition of the specification for {{FSDataOutputStream
create(Path, ...)}} states that the "... updated (valid) FileSystem must
contains all the parent directories of the path, as created by
mkdirs(parent(p)).". I would content that in this scenario, the opposite is
happening.
Second, the "Empty (non-root) directory" postcondition of the specification for
{{FileSystem.delete(Path P, boolean recursive)}} states that "Deleting an empty
directory that is not root will remove the path from the FS and return true.".
While this is occurring, I think that considering the a fake directory empty
even if it has another fake directory in it is incorrect. For example, on
debian, the following doesn't work.
{noformat}
[~]# mkdir job
[~]# cd job
[job]# mkdir task
[job]# cd ..
[~]# rmdir job
rmdir: failed to remove ‘job’: Directory not empty
{noformat}
Additionally, the interaction of AmazonS3Client/CyberDuck with empty
directories seems different than you described. See the following scenario:
# Open CyberDuck and connect to an S3 Bucket
# Create a folder called {{job}} in CyberDuck
# Right Click on the {{job}} folder and open +Info+. Result: _Size = 0B_ and
_S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}. Result:
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
# Navigate into the {{job}} folder in CyberDuck
# Create a folder called {{task}} in CyberDuck
# Right Click on the {{task}} folder and open +Info+. Result: _Size = 0B_ and
_S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}} Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}} Result:
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
#* _job/task/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Upload _file_ into _/job/task_ via CyberDuck
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}} Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}} Result:
_Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
#* _job/task/_
#* _job/task/file_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/task/"))}}. Result:
#* _s3a://bucket/job/task_ ^dir;empty^
#* _s3a://bucket/job/task/file_ ^file^
One thing to note above is the inconsistent results for
{{S3AFileSystem.listStatus(...))}}. In some cases a folder will be indicated
as empty and in other it will be not empty.
At this point, if you delete {{/job/task/file}} in CyberDuck or the AWS
Console, the {{/job}} and {{/job/task}} folders continue to exists and all
calls continue to return the same results as before (except {{/job/task/file}}
is excluded from any list results). If, on the other hand, you created
{{/job/task/file}} via S3AFileSystem, it would implicitly remove the parent
folders which it considers "empty". Then when {{/job/task/file}} is deleted,
the parent "empty" directories are also gone.
My last counterpoint to the current Hadoop behavior with regard to S3A is the
AWS S3 Console. It effectively models a filesystem despite the fact that it is
backed by a blobstore. I'm able to create nested folders, upload a file,
delete the file, and the nested "empty" folders still exist. As to the
consistency guarantees, this is solved by EMR, making even more like a true
FileSystem.
Regarding HADOOP-9565, I don't have any need or desire for that. I would
prefer that everything continues to function under the FileSystem paradigm.
The Posix API is good for me and consistency is a non-issue because of EMR.
Thanks!
> S3AFileSystem silently deletes "fake" directories when writing a file.
> ----------------------------------------------------------------------
>
> Key: HADOOP-14124
> URL: https://issues.apache.org/jira/browse/HADOOP-14124
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs, fs/s3
> Affects Versions: 2.6.0
> Reporter: Joel Baranick
> Labels: filesystem, s3
>
> I realize that you guys probably have a good reason for {{S3AFileSystem}} to
> cleanup "fake" folders when a file is written to S3. That said, that fact
> that it silently does this feels like a separation of concerns issue. It
> also leads to weird behavior issues where calls to
> {{AmazonS3Client.getObjectMetadata}} for folders work before calling
> {{S3AFileSystem.create}} but not after. Also, there seems to be no mention
> in the javadoc that the {{deleteUnnecessaryFakeDirectories}} method is
> automatically invoked. Lastly, it seems like the goal of {{FileSystem}}
> should be to ensure that code built on top of it is portable to different
> implementations. This behavior is an example of a case where this can break
> down.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]