[
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137937#comment-15137937
]
Chris Nauroth commented on HADOOP-12666:
----------------------------------------
[~vishwajeet.dusane] et al, thank you for the contribution. I'd like to start
by discussing a few things that we would commonly ask of any contribution of a
new Haodop-compatible file system:
# A new feature contribution requires end user documentation. For inspiration,
you can take a look at the existing documentation for
[S3|http://hadoop.apache.org/docs/r2.7.2/hadoop-aws/tools/hadoop-aws/index.html],
[Azure|http://hadoop.apache.org/docs/r2.7.2/hadoop-azure/index.html] and
[Swift|http://hadoop.apache.org/docs/r2.7.2/hadoop-openstack/index.html]. The
source for these pages is in the Hadoop code as Markdown, so you can look for
*.md to find examples. It's good to discuss how the semantics of this file
system will be the same as the semantics of HDFS and what will differ/work
differently/not work at all. One particular point worth documenting is
authentication, since there is a reliance on the WebHDFS OAuth feature here.
# Are there any tests that actually integrate with the back-end service, or is
everything using a mock server right now? We've seen in the past that real
integration testing is important. The mock-based testing is valuable too,
because Apache Jenkins won't have the capability to run against the live
service. A useful setup has been to have a common set of tests that support
running in both "mock mode" and "live mode". For an example of this, see the
hadoop-azure test suites.
# We have a set of [file system contract
tests|http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/testing.html]
that exercise the expected semantics of file systems. We ask that new file
system implementations implement this contract test suite. This is also a
great way to see how closely a file system matches the semantics of HDFS.
hadoop-azure does not yet run these tests, but HADOOP-12535 has a proposed
patch to start doing so.
I scanned through the code. This is not a comprehensive review, but there are
a few significant things that I've spotted so far:
# Some classes are using {{com.microsoft.azure.datalake}} as the root of the
package name. All package names must use the standard {{org.apache.hadoop}}
prefix instead.
# Can you please add JavaDoc comments at the top of
{{PrivateAzureDataLakeFileSystem}} and {{ADLFileSystem}} to describe the design
intent behind splitting the implementation classes like this? A potential
source of confusion is that {{PrivateAzureDataLakeFileSystem}} is using the
{{public}} Java access modifier, so the "private" in the class name is
unrelated to Java language visibility semantics.
# {{CachedRefreshTokenBasedAccessTokenProvider}}: The constructor might end up
creating multiple instances. There is no locking around creation of the shared
{{instance}}.
# {{PrivateDebugAzureDataLakeFileSystem}}/{{PrivateDebugAzureDataLake}}: It is
unusual to control level of logging by swapping out the {{FileSystem}}
implementation. Users will expect that logging level is controlled the
standard way through Log4J configuration. Can we eliminate these classes and
move appropriate logging at debug and trace level directly into
{{PrivateAzureDataLakeFileSystem}}?
# {{ADLLogger}}: With consideration of the above, I'm not sure that this class
adds any value over directly calling an SLF4J {{Log}} instance from the
relevant call sites. What do you think of dropping this class and converting
call sites to call SLF4J directly?
# {{ADLConfKeys}}: There is an established precedent in Hadoop configuration
for using all lower-case letters in configuration keys, delimited by '.' for
separation of words. I recommend staying consistent with that pattern.
# You are treading new ground in choosing to cache {{FileStatus}} instances
locally on the client side to achieve performance. This seems kind of like the
Linux concept of the dentry cache. The tricky thing is cache coherency across
mutliple processes running in the cluster and consistency semantics. I don't
have a specific example of something that could break, but this seems like
there could be a risk of something in the ecosystem breaking because of this.
No other Hadoop-compatible file system works like this. This also relates back
to my earlier point about documentation of expected semantics.
# {{ADLConfKeys#LOG_VERSION}}: I'm not sure what this is meant to be. Is the
goal to pass the current build version number to the back end? If so, then
consider looking at {{org.apache.hadoop.util.VersionInfo}}, which is already
tied into the Apache release process. This version string doesn't need to be
updated manually.
# {{TestAAAAAUploadBenchmark}}/{{TestAAAAADownloadBenchmark}}: From the
comments, it sounds like the "AAAA" in these test suite names is intended to
force a specific execution order, because some instability was seen while
running the tests in a different order. This isn't guaranteed to work though,
because Hadoop tests are gradually being converted to run in multiple
concurrent processes for performance reasons. In that model, you could have
multiple processes each running one of these "TestAAAA" suites, possibly also
running other suites, and ordering is ultimately non-deterministic. If there
is a test isolation bug, then please find root cause, fix the bug, and rename
these test suites without the "AAAA".
> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
> Key: HADOOP-12666
> URL: https://issues.apache.org/jira/browse/HADOOP-12666
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs, fs/azure, tools
> Reporter: Vishwajeet Dusane
> Assignee: Vishwajeet Dusane
> Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch,
> HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-1.patch
>
> Original Estimate: 336h
> Time Spent: 336h
> Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing
> Hadoop applications such has MR, HIVE, Hbase etc.., to use ADL store as
> input or output.
>
> ADL is ultra-high capacity, Optimized for massive throughput with rich
> management and security features. More details available at
> https://azure.microsoft.com/en-us/services/data-lake-store/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)