[
https://issues.apache.org/jira/browse/HADOOP-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671867#comment-15671867
]
ASF GitHub Bot commented on HADOOP-13655:
-----------------------------------------
Github user liuml07 commented on a diff in the pull request:
https://github.com/apache/hadoop/pull/131#discussion_r88343643
--- Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm ---
@@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources
The SSL configuration file must be in the class-path of the DistCp
program.
+$H3 DistCp and Object Stores
+
+DistCp works with Object Stores such as Amazon S3, Azure WASB and
OpenStack Swift.
+
+Prequisites
+
+1. The JAR containing the object store implementation is on the classpath,
+along with all of its dependencies.
+1. Unless the JAR automatically registers its bundled filesystem clients,
+the configuration may need to be modified to state the class which
+implements the filesystem schema. All of the ASF's own object store clients
+are self-registering.
+1. The relevant object store access credentials must be available in the
cluster
+configuration, or be otherwise available in all cluster hosts.
+
+DistCp can be used to upload data
+
+```bash
+hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1
+```
+
+To download data
+
+```bash
+hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results
+```
+
+To copy data between object stores
+
+```bash
+hadoop distcp s3a://bucket/generated/results \
+ wasb://[email protected]
+```
+
+And do copy data within an object store
+
+```bash
+hadoop distcp wasb://[email protected]/current \
+ wasb://[email protected]/old
+```
+
+And to use `-update` to only copy changed files.
+
+```bash
+hadoop distcp -update -numListstatusThreads 20 \
+ swift://history.cluster1/2016 \
+ hdfs://nn1:8020/history/2016
+```
+
+Because object stores are slow to list files, consider setting the
`-numListstatusThreads` option when performing a `-update` operation
+on a large directory tree (the limit is 40 threads).
+
+When `DistCp -update` is used with objec stores,
+generally only the modification time and length of the individual files
are compared,
+not any checksums. The fact that most object stores do have valid
timestamps
+for directories is irrelevant; only the file timestamps are compared.
+However, it is important to have the clock of the client computers close
+to that of the infrastructure, so that timestamps are consistent between
+the client/HDFS cluster and that of the object store. Otherwise, changed
files may be
+missed/copied too often.
+
+**Notes**
+
+* The `-atomic` option causes a rename of the temporary data, so
significantly
+increases the time to commit work at the end of the operation. Furthermore,
+as Object Stores other than (optionally) `wasb://` do not offer atomic
renames of directories
+the `-atomic` operation doesn't actually deliver what is promised. *Avoid*.
+
+* The `-append` option is not supported.
+
+* The `-diff` option is not supported
+
+* CRC checking will not be performed, irrespective of the value of the
`-skipCrc`
+flag.
+
+* All `-p` options, including those to preserve permissions, user and
group information, attributes
+checksums and replication are generally ignored. The `wasb://` connector
will
+preserve the information, but not enforce the permissions.
+
+* Some object store connectors offer an option for in-memory buffering of
+output —for example the S3A connector. Using such option while copying
+large files may trigger some form of out of memory event,
+be it a heap overflow or a YARN container termination.
+This is particularly common if the network bandwidth
+between the cluster and the object store is limited (such as when working
+with remote object stores). It is best to disable/avoid such options and
+rely on disk buffering.
+
+* Copy operations within a single object store still take place in the
Hadoop cluster
+—even when the object store implements a more efficient COPY operation
internally
+
+ That is, an operation such as
--- End diff --
The indention is unnecessary?
> document object store use with fs shell and distcp
> --------------------------------------------------
>
> Key: HADOOP-13655
> URL: https://issues.apache.org/jira/browse/HADOOP-13655
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: documentation, fs, fs/s3
> Affects Versions: 2.7.3
> Reporter: Steve Loughran
> Assignee: Steve Loughran
>
> There's no specific docs for working with object stores from the {{hadoop
> fs}} shell or in distcp; people either suffer from this (performance,
> billing), or learn through trial and error what to do.
> Add a section in both fs shell and distcp docs covering use with object
> stores.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]