[ 
https://issues.apache.org/jira/browse/HADOOP-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671867#comment-15671867
 ] 

ASF GitHub Bot commented on HADOOP-13655:
-----------------------------------------

Github user liuml07 commented on a diff in the pull request:

    https://github.com/apache/hadoop/pull/131#discussion_r88343643
  
    --- Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm ---
    @@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources
     
       The SSL configuration file must be in the class-path of the DistCp 
program.
     
    +$H3 DistCp and Object Stores
    +
    +DistCp works with Object Stores such as Amazon S3, Azure WASB and 
OpenStack Swift.
    +
    +Prequisites
    +
    +1. The JAR containing the object store implementation is on the classpath,
    +along with all of its dependencies.
    +1. Unless the JAR automatically registers its bundled filesystem clients,
    +the configuration may need to be modified to state the class which
    +implements the filesystem schema. All of the ASF's own object store clients
    +are self-registering.
    +1. The relevant object store access credentials must be available in the 
cluster
    +configuration, or be otherwise available in all cluster hosts.
    +
    +DistCp can be used to upload data
    +
    +```bash
    +hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1
    +```
    +
    +To download data
    +
    +```bash
    +hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results
    +```
    +
    +To copy data between object stores
    +
    +```bash
    +hadoop distcp s3a://bucket/generated/results \
    +  wasb://[email protected]
    +```
    +
    +And do copy data within an object store
    +
    +```bash
    +hadoop distcp wasb://[email protected]/current \
    +  wasb://[email protected]/old
    +```
    +
    +And to use `-update` to only copy changed files.
    +
    +```bash
    +hadoop distcp -update -numListstatusThreads 20  \
    +  swift://history.cluster1/2016 \
    +  hdfs://nn1:8020/history/2016
    +```
    +
    +Because object stores are slow to list files, consider setting the 
`-numListstatusThreads` option when performing a `-update` operation
    +on a large directory tree (the limit is 40 threads).
    +
    +When `DistCp -update` is used with objec stores,
    +generally only the modification time and length of the individual files 
are compared,
    +not any checksums. The fact that most object stores do have valid 
timestamps
    +for directories is irrelevant; only the file timestamps are compared.
    +However, it is important to have the clock of the client computers close
    +to that of the infrastructure, so that timestamps are consistent between
    +the client/HDFS cluster and that of the object store. Otherwise, changed 
files may be
    +missed/copied too often.
    +
    +**Notes**
    +
    +* The `-atomic` option causes a rename of the temporary data, so 
significantly
    +increases the time to commit work at the end of the operation. Furthermore,
    +as Object Stores other than (optionally) `wasb://` do not offer atomic 
renames of directories
    +the `-atomic` operation doesn't actually deliver what is promised. *Avoid*.
    +
    +* The `-append` option is not supported.
    +
    +* The `-diff` option is not supported
    + 
    +* CRC checking will not be performed, irrespective of the value of the 
`-skipCrc`
    +flag.
    +
    +* All `-p` options, including those to preserve permissions, user and 
group information, attributes
    +checksums and replication are generally ignored. The `wasb://` connector 
will
    +preserve the information, but not enforce the permissions.
    +
    +* Some object store connectors offer an option for in-memory buffering of
    +output —for example the S3A connector. Using such option while copying
    +large files may trigger some form of out of memory event,
    +be it a heap overflow or a YARN container termination.
    +This is particularly common if the network bandwidth
    +between the cluster and the object store is limited (such as when working
    +with remote object stores). It is best to disable/avoid such options and
    +rely on disk buffering.
    +
    +* Copy operations within a single object store still take place in the 
Hadoop cluster
    +—even when the object store implements a more efficient COPY operation 
internally
    +
    +    That is, an operation such as
    --- End diff --
    
    The indention is unnecessary?


> document object store use with fs shell and distcp
> --------------------------------------------------
>
>                 Key: HADOOP-13655
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13655
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: documentation, fs, fs/s3
>    Affects Versions: 2.7.3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>
> There's no specific docs for working with object stores from the {{hadoop 
> fs}} shell or in distcp; people either suffer from this (performance, 
> billing), or learn through trial and error what to do.
> Add a section in both fs shell and distcp docs covering use with object 
> stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to