[
https://issues.apache.org/jira/browse/HADOOP-13278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333823#comment-15333823
]
Adrian Petrescu commented on HADOOP-13278:
------------------------------------------
Hey [~cnauroth] and [[email protected]]; thank you for the very detailed
explanations. I can totally understand that Hadoop-compatible filesystems may
have additional semantics beyond just those of the underlying object store,
which this change would violate.
I do want to note, though, that the main motivation behind my patch was the
permissions-related issue it was solving, not the minor performance
improvement. That issue still exists, and I'll file a new issue for it as
[[email protected]] recommended, but first I just want to understand whether
it is even possible/desirable for S3A to support buckets which have those kinds
of IAM roles applied.
Consider a bucket with a policy like the one below (which is a pretty common
setup):
{code}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "prodS3EnvironmentData",
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::mybucket/a/b/c/*"
]
}
]
}
{code}
Given what I now understand about the S3A contract, it seems like you simply
cannot layer an S3A filesystem on this bucket, even if it was "rooted" in
directory {{a/b/c}}, since any directory, say {{a/b/c/d}} could never validate
whether a file named {{a/b}} exists - it would throw a 403.
Is this correct? If so, I'm not sure a separate issue is needed; the use case
would simply be unsupported and I'll have to move my S3A filesystem to a bucket
that grants Hadoop/Spark root access.
> S3AFileSystem mkdirs does not need to validate parent path components
> ---------------------------------------------------------------------
>
> Key: HADOOP-13278
> URL: https://issues.apache.org/jira/browse/HADOOP-13278
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3, tools
> Reporter: Adrian Petrescu
> Priority: Minor
>
> According to S3 semantics, there is no conflict if a bucket contains a key
> named {{a/b}} and also a directory named {{a/b/c}}. "Directories" in S3 are,
> after all, nothing but prefixes.
> However, the {{mkdirs}} call in {{S3AFileSystem}} does go out of its way to
> traverse every parent path component for the directory it's trying to create,
> making sure there's no file with that name. This is suboptimal for three main
> reasons:
> * Wasted API calls, since the client is getting metadata for each path
> component
> * This can cause *major* problems with buckets whose permissions are being
> managed by IAM, where access may not be granted to the root bucket, but only
> to some prefix. When you call {{mkdirs}}, even on a prefix that you have
> access to, the traversal up the path will cause you to eventually hit the
> root bucket, which will fail with a 403 - even though the directory creation
> call would have succeeded.
> * Some people might actually have a file that matches some other file's
> prefix... I can't see why they would want to do that, but it's not against
> S3's rules.
> I've opened a pull request with a simple patch that just removes this portion
> of the check. I have tested it with my team's instance of Spark + Luigi, and
> can confirm it works, and resolves the aforementioned permissions issue for a
> bucket on which we only had prefix access.
> This is my first ticket/pull request against Hadoop, so let me know if I'm
> not following some convention properly :)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]