[
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822651#comment-15822651
]
Joep Rottinghuis commented on HADOOP-13114:
-------------------------------------------
I have similar concerns to the ones raised, a copy shouldn't change the format.
It seems that the patch doesn't allow to use both -update and compress at the
same time. What if the copy was done first with -compress, then a user wants to
switch to -update and then changes their job to remove the -compress and switch
to the -update. It will result in all files getting copied again right?
In the current approach the compression seems to happen on the write-side. That
means that for copies across expensive network (such as cross-dc copies) the
data still travels uncompressed first.
Wouldn't it make sense to create wrapper functionality to first compress on the
source, then use regular distcp? Possibly the compressed temporary data could
be in a /tmp directory structure. Alternatively one can still distcp first (to
a tmp location) and then compress if that is desired. The advantage to keep the
compression step separate from the distcp step is that one could additionally
collapse files together into fewer files if possible.
We're finding that our users already have a hard time dealing with the
intricacies of interactions of various distcp flags (-atomic, -update, etc.).
> DistCp should have option to compress data on write
> ---------------------------------------------------
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
> Reporter: Suraj Nayak
> Assignee: Suraj Nayak
> Priority: Minor
> Labels: distcp
> Attachments: HADOOP-13114.05.patch, HADOOP-13114.06.patch,
> HADOOP-13114-trunk_2016-05-07-1.patch, HADOOP-13114-trunk_2016-05-08-1.patch,
> HADOOP-13114-trunk_2016-05-10-1.patch, HADOOP-13114-trunk_2016-05-12-1.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified
> compression format. This avoids one hop of compressing data after transfer.
> Backup strategies to different cluster also get benefit of saving one IO
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to
> {{org.apache.hadoop.io.compress.BZip2Codec}}.
> * Users will be able to change codec with {{-D
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec
> extension to indicate the file is compressed. Thus users can be aware of what
> codec was used to compress the data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]