[
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189587#comment-15189587
]
Chris Nauroth commented on HADOOP-12666:
----------------------------------------
The create/append/flush sequence is hugely different behavior. At the protocol
layer, there is the addition of the flush parameter, which is a deviation from
stock WebHDFS. Basically any of the custom *Param classes represent deviations
from WebHDFS protocol: leaseId, ADLFeatureSet, etc.
At the client layer, the aggressive client-side caching and buffering in the
name of performance creates different behavior from stock WebHDFS. I and
others have called out that while perhaps you don't observe anything to be
broken right now, that's no guarantee that cache consistency won't become a
problem for certain applications. This is not a wire protocol difference, but
it is a significant deviation in behavior from stock WebHDFS.
At this point, it appears that the ADL protocol, while heavily inspired by the
WebHDFS protocol, is not really a compatible match. It is its own protocol
with its own unique requirements for clients to use it correctly and use it
well. Accidentally connecting the ADL client to an HDFS cluster would be
disastrous. The create/append/flush sequence would cause massive unsustainable
load to the NameNode in terms of RPC calls and edit logging. Client write
latency would be unacceptable. Likewise, accidentally connecting the stock
WebHDFS client to ADL seems to yield unacceptable performance for ADL.
It is these large deviations that lead me to conclude the best choice is a
dedicated client distinct from the WebHDFS client code. Having full control of
that client gives us the opportunity to provide the best possible user
experience with ADL. As I've stated before though, I can accept a short-term
plan of some code reuse with the WebHDFS client.
> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
> Key: HADOOP-12666
> URL: https://issues.apache.org/jira/browse/HADOOP-12666
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs, fs/azure, tools
> Reporter: Vishwajeet Dusane
> Assignee: Vishwajeet Dusane
> Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch,
> HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-006.patch,
> HADOOP-12666-007.patch, HADOOP-12666-008.patch, HADOOP-12666-1.patch
>
> Original Estimate: 336h
> Time Spent: 336h
> Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing
> Hadoop applications such has MR, HIVE, Hbase etc.., to use ADL store as
> input or output.
>
> ADL is ultra-high capacity, Optimized for massive throughput with rich
> management and security features. More details available at
> https://azure.microsoft.com/en-us/services/data-lake-store/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)