[jira] [Updated] (HADOOP-14766) Cloudup: an object store high performance dfs put command

Steve Loughran (JIRA) Mon, 06 Nov 2017 11:16:19 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-14766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran updated HADOOP-14766:
------------------------------------
    Attachment: HADOOP-14766-001.patch

Patch 001

this is the initial PoC imported into Hadoop under hadoop-common; eliminate 
copy & past of ContractTestUtils.NanoTime by moving the class and then 
retaining the old one as a subclass of the moved one.

I'm not 100% sure this is the right home, but we don't yet have an explicit 
cloud module.

Note: this also works with HDFS, even across the local FS...any FS which 
implements their own version of {{copyFromLocalFile}}  will benefit from it.

Testing: only manually against S3A and its copyFromLocalFile.

There's no check for changed files; i.e. against checksums, timestamps or 
similar. None planned. This is primarily of a local-to-store upload program 
with comparable speed to that shipped with the AWS SDK, but able to work with 
any remote HCFS store, not some incremental backup mechanism. Though if someone 
were to issue getChecksum(path) across all the stores, it'd be good to log 
that, possibly even export a minimal avro file summary


> Cloudup: an object store high performance dfs put command
> ---------------------------------------------------------
>
>                 Key: HADOOP-14766
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14766
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, fs/s3
>    Affects Versions: 2.8.1
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-14766-001.patch
>
>
> {{hdfs put local s3a://path}} is suboptimal as it treewalks down down the 
> source tree then, sequentially, copies up the file through copying the file 
> (opened as a stream) contents to a buffer, writes that to the dest file, 
> repeats.
> For S3A that hurts because
> * it;s doing the upload inefficiently: the file can be uploaded just by 
> handling the pathname to the AWS xter manager
> * it is doing it sequentially, when some parallelised upload would work. 
> * as the ordering of the files to upload is a recursive treewalk, it doesn't 
> spread the upload across multiple shards. 
> Better:
> * build the list of files to upload
> * upload in parallel, picking entries from the list at random and spreading 
> across a pool of uploaders
> * upload straight from local file (copyFromLocalFile()
> * track IO load (files created/second) to estimate risk of throttling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-14766) Cloudup: an object store high performance dfs put command

Reply via email to