[ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825535#comment-15825535
 ] 

Yongjun Zhang commented on HADOOP-13114:
----------------------------------------

Thanks [~raviprak] for the patch and all for the discussion here.

One possible use of only compressing data at write is, we can save disk space 
at target side. Imagine if the target is a backup cluster that need to save 
space. 

Yes, this possibly can be implemented with a tool to do the compression after 
distcp, but that means the target need to store both the original files and 
compressed files before the originals are deleted.

I have some thoughts about HADOOP-8065, will put there shortly.

Thanks.


> DistCp should have option to compress data on write
> ---------------------------------------------------
>
>                 Key: HADOOP-13114
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13114
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>            Reporter: Suraj Nayak
>            Assignee: Suraj Nayak
>            Priority: Minor
>              Labels: distcp
>         Attachments: HADOOP-13114.05.patch, HADOOP-13114.06.patch, 
> HADOOP-13114-trunk_2016-05-07-1.patch, HADOOP-13114-trunk_2016-05-08-1.patch, 
> HADOOP-13114-trunk_2016-05-10-1.patch, HADOOP-13114-trunk_2016-05-12-1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to