[ 
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137937#comment-15137937
 ] 

Chris Nauroth commented on HADOOP-12666:
----------------------------------------

[~vishwajeet.dusane] et al, thank you for the contribution.  I'd like to start 
by discussing a few things that we would commonly ask of any contribution of a 
new Haodop-compatible file system:
# A new feature contribution requires end user documentation.  For inspiration, 
you can take a look at the existing documentation for 
[S3|http://hadoop.apache.org/docs/r2.7.2/hadoop-aws/tools/hadoop-aws/index.html],
 [Azure|http://hadoop.apache.org/docs/r2.7.2/hadoop-azure/index.html] and 
[Swift|http://hadoop.apache.org/docs/r2.7.2/hadoop-openstack/index.html].  The 
source for these pages is in the Hadoop code as Markdown, so you can look for 
*.md to find examples.  It's good to discuss how the semantics of this file 
system will be the same as the semantics of HDFS and what will differ/work 
differently/not work at all.  One particular point worth documenting is 
authentication, since there is a reliance on the WebHDFS OAuth feature here.
# Are there any tests that actually integrate with the back-end service, or is 
everything using a mock server right now?  We've seen in the past that real 
integration testing is important.  The mock-based testing is valuable too, 
because Apache Jenkins won't have the capability to run against the live 
service.  A useful setup has been to have a common set of tests that support 
running in both "mock mode" and "live mode".  For an example of this, see the 
hadoop-azure test suites.
# We have a set of [file system contract 
tests|http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/testing.html]
 that exercise the expected semantics of file systems.  We ask that new file 
system implementations implement this contract test suite.  This is also a 
great way to see how closely a file system matches the semantics of HDFS.  
hadoop-azure does not yet run these tests, but HADOOP-12535 has a proposed 
patch to start doing so.

I scanned through the code.  This is not a comprehensive review, but there are 
a few significant things that I've spotted so far:
# Some classes are using {{com.microsoft.azure.datalake}} as the root of the 
package name.  All package names must use the standard {{org.apache.hadoop}} 
prefix instead.
# Can you please add JavaDoc comments at the top of 
{{PrivateAzureDataLakeFileSystem}} and {{ADLFileSystem}} to describe the design 
intent behind splitting the implementation classes like this?  A potential 
source of confusion is that {{PrivateAzureDataLakeFileSystem}} is using the 
{{public}} Java access modifier, so the "private" in the class name is 
unrelated to Java language visibility semantics.
# {{CachedRefreshTokenBasedAccessTokenProvider}}: The constructor might end up 
creating multiple instances.  There is no locking around creation of the shared 
{{instance}}.
# {{PrivateDebugAzureDataLakeFileSystem}}/{{PrivateDebugAzureDataLake}}: It is 
unusual to control level of logging by swapping out the {{FileSystem}} 
implementation.  Users will expect that logging level is controlled the 
standard way through Log4J configuration.  Can we eliminate these classes and 
move appropriate logging at debug and trace level directly into 
{{PrivateAzureDataLakeFileSystem}}?
# {{ADLLogger}}: With consideration of the above, I'm not sure that this class 
adds any value over directly calling an SLF4J {{Log}} instance from the 
relevant call sites.  What do you think of dropping this class and converting 
call sites to call SLF4J directly?
# {{ADLConfKeys}}: There is an established precedent in Hadoop configuration 
for using all lower-case letters in configuration keys, delimited by '.' for 
separation of words.  I recommend staying consistent with that pattern.
# You are treading new ground in choosing to cache {{FileStatus}} instances 
locally on the client side to achieve performance.  This seems kind of like the 
Linux concept of the dentry cache.  The tricky thing is cache coherency across 
mutliple processes running in the cluster and consistency semantics.  I don't 
have a specific example of something that could break, but this seems like 
there could be a risk of something in the ecosystem breaking because of this.  
No other Hadoop-compatible file system works like this.  This also relates back 
to my earlier point about documentation of expected semantics.
# {{ADLConfKeys#LOG_VERSION}}: I'm not sure what this is meant to be.  Is the 
goal to pass the current build version number to the back end?  If so, then 
consider looking at {{org.apache.hadoop.util.VersionInfo}}, which is already 
tied into the Apache release process.  This version string doesn't need to be 
updated manually.
# {{TestAAAAAUploadBenchmark}}/{{TestAAAAADownloadBenchmark}}: From the 
comments, it sounds like the "AAAA" in these test suite names is intended to 
force a specific execution order, because some instability was seen while 
running the tests in a different order.  This isn't guaranteed to work though, 
because Hadoop tests are gradually being converted to run in multiple 
concurrent processes for performance reasons.  In that model, you could have 
multiple processes each running one of these "TestAAAA" suites, possibly also 
running other suites, and ordering is ultimately non-deterministic.  If there 
is a test isolation bug, then please find root cause, fix the bug, and rename 
these test suites without the "AAAA".


> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, 
> HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft 
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing 
> Hadoop applications such has MR, HIVE, Hbase etc..,  to use ADL store as 
> input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich 
> management and security features. More details available at 
> https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to