[
https://issues.apache.org/jira/browse/HADOOP-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671868#comment-15671868
]
ASF GitHub Bot commented on HADOOP-13655:
-----------------------------------------
Github user liuml07 commented on a diff in the pull request:
https://github.com/apache/hadoop/pull/131#discussion_r88342205
--- Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm ---
@@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources
The SSL configuration file must be in the class-path of the DistCp
program.
+$H3 DistCp and Object Stores
+
+DistCp works with Object Stores such as Amazon S3, Azure WASB and
OpenStack Swift.
+
+Prequisites
+
+1. The JAR containing the object store implementation is on the classpath,
+along with all of its dependencies.
+1. Unless the JAR automatically registers its bundled filesystem clients,
+the configuration may need to be modified to state the class which
+implements the filesystem schema. All of the ASF's own object store clients
+are self-registering.
+1. The relevant object store access credentials must be available in the
cluster
+configuration, or be otherwise available in all cluster hosts.
+
+DistCp can be used to upload data
+
+```bash
+hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1
+```
+
+To download data
+
+```bash
+hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results
+```
+
+To copy data between object stores
+
+```bash
+hadoop distcp s3a://bucket/generated/results \
+ wasb://[email protected]
+```
+
+And do copy data within an object store
+
+```bash
+hadoop distcp wasb://[email protected]/current \
+ wasb://[email protected]/old
+```
+
+And to use `-update` to only copy changed files.
+
+```bash
+hadoop distcp -update -numListstatusThreads 20 \
+ swift://history.cluster1/2016 \
+ hdfs://nn1:8020/history/2016
+```
+
+Because object stores are slow to list files, consider setting the
`-numListstatusThreads` option when performing a `-update` operation
+on a large directory tree (the limit is 40 threads).
+
+When `DistCp -update` is used with objec stores,
+generally only the modification time and length of the individual files
are compared,
+not any checksums. The fact that most object stores do have valid
timestamps
+for directories is irrelevant; only the file timestamps are compared.
+However, it is important to have the clock of the client computers close
+to that of the infrastructure, so that timestamps are consistent between
+the client/HDFS cluster and that of the object store. Otherwise, changed
files may be
+missed/copied too often.
+
+**Notes**
+
+* The `-atomic` option causes a rename of the temporary data, so
significantly
+increases the time to commit work at the end of the operation. Furthermore,
+as Object Stores other than (optionally) `wasb://` do not offer atomic
renames of directories
+the `-atomic` operation doesn't actually deliver what is promised. *Avoid*.
+
+* The `-append` option is not supported.
+
+* The `-diff` option is not supported
--- End diff --
The `-diff/-rdiff` option is not supported
Yes there is an `rdiff` options that is just added.
> document object store use with fs shell and distcp
> --------------------------------------------------
>
> Key: HADOOP-13655
> URL: https://issues.apache.org/jira/browse/HADOOP-13655
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: documentation, fs, fs/s3
> Affects Versions: 2.7.3
> Reporter: Steve Loughran
> Assignee: Steve Loughran
>
> There's no specific docs for working with object stores from the {{hadoop
> fs}} shell or in distcp; people either suffer from this (performance,
> billing), or learn through trial and error what to do.
> Add a section in both fs shell and distcp docs covering use with object
> stores.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]