[
https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran updated HADOOP-13786:
------------------------------------
Attachment: HADOOP-13786-HADOOP-13345-002.patch
Patch 002
This patch
# defines the mechanism for using multipart uploads for commits
# implements it
# adds tests
# has tests working, including a scale one.
It has not wired this up to MRv1/v2 output committers & FileOutputFormat,
though the changes have been made to the MRv2 code to make it possible to place
the S3A committer in behind FileOutputFormat.
What works, then? Dela
# create a file with a path {{/example/__pending/job1/task1/part-001.bin}}
# this will initiate an MPU to {{/example/part001.bin}}, to which all the
output goes.
# when the output stream is closed, the file
{{/example/__pending/job1/task1/part-001.bin.pending}} is created, which saves
everything needed to commit the job
# later a {{FileCommitActions}} class can be created.
# {{FileCommitActions .commitAllPendingFilesInPath()}} will scan a dir for
.pending entries, and commit them one by one, here
{{commitAllPendingFilesInPath("/example/__pending/job1/task1/")}}
# ..which causes the file {{/example/part001.bin}} to come into existence.
# or, if you call {{abortAllPendingFilesInPath(...)}} the MPUs are read and
aborted.
Performance? < 1s to commit a single 128MB file over a long-haul link.
{code}
Duration of time to commit
s3a://hwdev-steve-frankfurt-new/tests3ascale/scale/commit/__pending/job_001/commit.bin.pending:
688,701,514 nS
{code}
I think that's pretty good :)
> Add S3Guard committer for zero-rename commits to consistent S3 endpoints
> ------------------------------------------------------------------------
>
> Key: HADOOP-13786
> URL: https://issues.apache.org/jira/browse/HADOOP-13786
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs/s3
> Affects Versions: HADOOP-13345
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Attachments: HADOOP-13786-HADOOP-13345-001.patch,
> HADOOP-13786-HADOOP-13345-002.patch
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the
> presence of failures". Implement it, including whatever is needed to
> demonstrate the correctness of the algorithm. (that is, assuming that s3guard
> provides a consistent view of the presence/absence of blobs, show that we can
> commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output
> streams (ie. not visible until the close()), if we need to use that to allow
> us to abort commit operations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]