[
https://issues.apache.org/jira/browse/HADOOP-14766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran updated HADOOP-14766:
------------------------------------
Attachment: HADOOP-14766-001.patch
Patch 001
this is the initial PoC imported into Hadoop under hadoop-common; eliminate
copy & past of ContractTestUtils.NanoTime by moving the class and then
retaining the old one as a subclass of the moved one.
I'm not 100% sure this is the right home, but we don't yet have an explicit
cloud module.
Note: this also works with HDFS, even across the local FS...any FS which
implements their own version of {{copyFromLocalFile}} will benefit from it.
Testing: only manually against S3A and its copyFromLocalFile.
There's no check for changed files; i.e. against checksums, timestamps or
similar. None planned. This is primarily of a local-to-store upload program
with comparable speed to that shipped with the AWS SDK, but able to work with
any remote HCFS store, not some incremental backup mechanism. Though if someone
were to issue getChecksum(path) across all the stores, it'd be good to log
that, possibly even export a minimal avro file summary
> Cloudup: an object store high performance dfs put command
> ---------------------------------------------------------
>
> Key: HADOOP-14766
> URL: https://issues.apache.org/jira/browse/HADOOP-14766
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs, fs/azure, fs/s3
> Affects Versions: 2.8.1
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
> Attachments: HADOOP-14766-001.patch
>
>
> {{hdfs put local s3a://path}} is suboptimal as it treewalks down down the
> source tree then, sequentially, copies up the file through copying the file
> (opened as a stream) contents to a buffer, writes that to the dest file,
> repeats.
> For S3A that hurts because
> * it;s doing the upload inefficiently: the file can be uploaded just by
> handling the pathname to the AWS xter manager
> * it is doing it sequentially, when some parallelised upload would work.
> * as the ordering of the files to upload is a recursive treewalk, it doesn't
> spread the upload across multiple shards.
> Better:
> * build the list of files to upload
> * upload in parallel, picking entries from the list at random and spreading
> across a pool of uploaders
> * upload straight from local file (copyFromLocalFile()
> * track IO load (files created/second) to estimate risk of throttling.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]