[
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684765#comment-15684765
]
Yongjun Zhang commented on HADOOP-13114:
----------------------------------------
HI [~snayakm] and [~raviprak], thanks a lot for your earlier work here!
HI Ravi, I did a review of latest rev 5 you posted, some comments here:
1. All items listed in
https://issues.apache.org/jira/browse/HADOOP-8065?focusedCommentId=15668944&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15668944
* use constants instead of hardcoded ones.
* use DistCp's own set of configuration instead of the FileOutputFormat ones.
This would separate distcp from other mapreduce job's config.
* let DistCp fail before getting to mapper, if the compression is enabled with
invalid codec
* added a negative test
which I did in the latest patch version in HADOOP-8065.
2. Think about using extended attributes to address
https://issues.apache.org/jira/browse/HADOOP-8065?focusedCommentId=15670862&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15670862
3. Nits: misnomer in {{private boolean outputCodec = false;}}, which meant to
be {{compressOutput}}
I think 2 can be deferred to later in a separate jira.
What do you think?
Thanks.
> DistCp should have option to compress data on write
> ---------------------------------------------------
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
> Reporter: Suraj Nayak
> Assignee: Suraj Nayak
> Priority: Minor
> Labels: distcp
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch,
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch,
> HADOOP-13114-trunk_2016-05-12-1.patch, HADOOP-13114.05.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified
> compression format. This avoids one hop of compressing data after transfer.
> Backup strategies to different cluster also get benefit of saving one IO
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to
> {{org.apache.hadoop.io.compress.BZip2Codec}}.
> * Users will be able to change codec with {{-D
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec
> extension to indicate the file is compressed. Thus users can be aware of what
> codec was used to compress the data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]