[
https://issues.apache.org/jira/browse/HADOOP-18672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704088#comment-17704088
]
Steve Loughran commented on HADOOP-18672:
-----------------------------------------
been a long time since I read that paper; I should reread it. in fact, i should
look to see if there is a more recent paper than that 2011 paper, as azure
storage has come a long way.
Anyway,
# abfs is a client connector to classic azure storage, and the ADLS Gen2 store,
which add a hierarchical namespace on top, giving things like directories and
atomic O(1) operations on them.
# Azure storage contains no HDFS code, and whatever checksums are used in their
implementation, (1) we have no idea if they are compatible with HDFS,
especially as they will be of the encrypted data, not what was uploaded, and
(2) I don't believe they are exposed through the REST APIs.
# ABFS supports etag source on file status/listings (HADOOP-17979), and as ADLS
Gen2 preserves etags across file rename operations, might be useful.
so to update the statement I made before
* all large scale stores use checksums to validate the continued correctness of
data, usually with some hidden recovery mechanism.
* any store which encrypts blocks will have different checksums for the same
data stored in different accounts/with different keys. Even HDFS encryption
zones do this.
* we don't know what checksum algorithm microsoft use
* azure storage doesn't export checksums compatible with HDFS through its REST
API
* ADLS Gen 2 has etags on files and dirs; file etags are preserved over renames
* Hadoop has a way for you to get at those etags and use them as you see fit.
If you want a cloud-friendly distcp, supporting source and destination
stores/filesystems with different checksum or etag algorithms has to be a
requirement. It'd need to remember the source and dest checksums somewhere
(files?) and use that data when working out what has changed so deciding what
to copy.
> ask: abfs connector to support checksum
> ---------------------------------------
>
> Key: HADOOP-18672
> URL: https://issues.apache.org/jira/browse/HADOOP-18672
> Project: Hadoop Common
> Issue Type: Wish
> Components: fs/azure
> Reporter: Wei-Hsiang Lin
> Priority: Major
>
> Hi Hadoop-Azure community,
> I cannot find much information on reason why abfs connector file level
> checksum is not supported, could you share some insights on why it doesn't
> support and is there plan to support in the future ?
> having this would be helpful for migrating data from on-prem to Azure storage
> using abfs connector
> ref https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]