[
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15815294#comment-15815294
]
Nathan Roberts commented on HADOOP-13114:
-----------------------------------------
Sorry for jumping in late. I tend to agree this seems like it might be outside
the scope of distcp. I understand the desire to support this capability but it
seems like the use-cases get strange if we fold it into distcp itself. It might
be as simple as creating a new command: "distcompress" or something similar,
which could share exactly the same code-base as distcp but only has this new
capability in that mode. Some of the worries I have with having it in distcp
are:
- Just the name bothers me a bit. copy commands don't normally transform data,
but this one would.
- What happens if we run the command with compression twice? distcp a->b, then
b->c? I'm assuming c is a compressed version of b which is a compressed version
of a. In order to read we'd have to unwind both layers of compression. Seems
strange and really easy to accidentally have this happen.
- I'm assuming CRC checks have to be disabled when doing this. Did we force the
user to disable CRC checks by providing the necessary option or did we just do
it automatically? If automatic, should WARN them this happened.
- Obvious question is: "if it's valuable to compress, why wasn't it compressed
in the first place?"
> DistCp should have option to compress data on write
> ---------------------------------------------------
>
> Key: HADOOP-13114
> URL: https://issues.apache.org/jira/browse/HADOOP-13114
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
> Reporter: Suraj Nayak
> Assignee: Suraj Nayak
> Priority: Minor
> Labels: distcp
> Attachments: HADOOP-13114-trunk_2016-05-07-1.patch,
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch,
> HADOOP-13114-trunk_2016-05-12-1.patch, HADOOP-13114.05.patch,
> HADOOP-13114.06.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified
> compression format. This avoids one hop of compressing data after transfer.
> Backup strategies to different cluster also get benefit of saving one IO
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to
> {{org.apache.hadoop.io.compress.BZip2Codec}}.
> * Users will be able to change codec with {{-D
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec
> extension to indicate the file is compressed. Thus users can be aware of what
> codec was used to compress the data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]