[ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819365#comment-15819365
 ] 

Koji Noguchi commented on HADOOP-13114:
---------------------------------------

bq. Could you please elucidate your concern if its not that?

My point is, this command won't be useful unless the compressed outputs are 
directly readable by hadoop jobs.
Avro, Orc, RCFile, SequenceFile etc and other common file formats all have 
their own ways of compressing and simply gzip/bzip-ing the entire files won't 
do any good.
Worse, I don't think the patch provides a way to uncompress them back.

bq.  but that means we'd make assumptions about Hadoop's use cases

And I'd say you're assuming users would only call this distcp+compress on text 
files only.
Files with other fileformat would become unreadable (until uncompressed back).


I agree with Nathan on the naming. If the command is called 
{{dist-text-compress}}, then I'll have no concerns.

> DistCp should have option to compress data on write
> ---------------------------------------------------
>
>                 Key: HADOOP-13114
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13114
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>            Reporter: Suraj Nayak
>            Assignee: Suraj Nayak
>            Priority: Minor
>              Labels: distcp
>         Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, 
> HADOOP-13114-trunk_2016-05-12-1.patch, HADOOP-13114.05.patch, 
> HADOOP-13114.06.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to