[
https://issues.apache.org/jira/browse/HADOOP-13282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152713#comment-16152713
]
Steve Loughran commented on HADOOP-13282:
-----------------------------------------
I plan to go about adding this for ease of allowing a distcp successor to
record the etags of uploaded files, so support incremental uploads.
It might also be possible to cache this value in the S3A input stream from the
first time a file was opened, and check it on every GET call initiated during
seek. This would help detect inconsistencies during a read, one in which either
a newer file is overwritten during the read process, or an older copy of the
file suddenly gets loaded.
For maximum efficiency, we'd have to cache the etags in s3guard, though it's
not absolutely necessary: you get the etag on the first GET in a stream read,
and for {{getFileChecksum()}}, callers can take the performance hit; it will be
no worse than a HEAD call today, and used in fewer code paths (we can infer
that from the fact it doesn't do anything today & nobody has complained)
> S3 blob etags to be made visible in status/getFileChecksum() calls
> ------------------------------------------------------------------
>
> Key: HADOOP-13282
> URL: https://issues.apache.org/jira/browse/HADOOP-13282
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.9.0
> Reporter: Steve Loughran
> Priority: Minor
>
> If the etags of blobs were exported via {{getFileChecksum()}}, it'd be
> possible to probe for a blob being in sync with a local file. Distcp could
> use this to decide whether to skip a file or not.
> Now, there's a problem there: distcp needs source and dest filesystems to
> implement the same algorithm. It'd only work out the box if you were copying
> between S3 instances. There are also quirks with encryption and multipart:
> [s3
> docs|http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html].
> At the very least, it's something which could be used when indexing the FS,
> to check for changes later.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]