[ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15815294#comment-15815294
 ] 

Nathan Roberts commented on HADOOP-13114:
-----------------------------------------

Sorry for jumping in late. I tend to agree this seems like it might be outside 
the scope of distcp. I understand the desire to support this capability but it 
seems like the use-cases get strange if we fold it into distcp itself. It might 
be as simple as creating a new command: "distcompress" or something similar, 
which could share exactly the same code-base as distcp but only has this new 
capability in that mode. Some of the worries I have with having it in distcp 
are:
- Just the name bothers me a bit. copy commands don't normally transform data, 
but this one would. 
- What happens if we run the command with compression twice? distcp a->b, then 
b->c? I'm assuming c is a compressed version of b which is a compressed version 
of a. In order to read we'd have to unwind both layers of compression. Seems 
strange and really easy to accidentally have this happen.
- I'm assuming CRC checks have to be disabled when doing this. Did we force the 
user to disable CRC checks by providing the necessary option or did we just do 
it automatically? If automatic, should WARN them this happened.
- Obvious question is: "if it's valuable to compress, why wasn't it compressed 
in the first place?" 
  

> DistCp should have option to compress data on write
> ---------------------------------------------------
>
>                 Key: HADOOP-13114
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13114
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>            Reporter: Suraj Nayak
>            Assignee: Suraj Nayak
>            Priority: Minor
>              Labels: distcp
>         Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, 
> HADOOP-13114-trunk_2016-05-12-1.patch, HADOOP-13114.05.patch, 
> HADOOP-13114.06.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to