[
https://issues.apache.org/jira/browse/HADOOP-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366181#comment-15366181
]
Jason Lowe commented on HADOOP-13340:
-------------------------------------
Yes a splittable codec could be used to accomplish something simmilar, but
again the splits won't necessarily occur on the file boundaries -- depending
upon the codec block size a significant amount of data may need to be
decompressed and thrown away before arriving at the original file data within
the codec block. (Note that a larger codec block size could compress multiple
original small files and get a better overall compression ratio, so there's a
tradeoff.)
As I mentioned above, if the intent is to compress the original files on file
boundaries when adding them to the har then IMHO the problem is the original
files should have been compressed in the first place before trying to do the
har. Otherwise those intending to consume the original files will find
compressed data in the har rather than original file data and will need to know
that they need a codec to get back to the original file contents. If the
purpose of this request is to provide transparent compression within the har
then that will need a splittable codec or reset the codec on file boundaries
and set flags in the har (in a backwards-compatible manner) to indicate how
compression was performed so the resulting input stream can compensate for how
the data is laid out within the har.
> Compress Hadoop Archive output
> ------------------------------
>
> Key: HADOOP-13340
> URL: https://issues.apache.org/jira/browse/HADOOP-13340
> Project: Hadoop Common
> Issue Type: New Feature
> Components: tools
> Affects Versions: 2.5.0
> Reporter: Duc Le Tu
> Labels: features, performance
>
> Why Hadoop Archive tool cannot compress output like other map-reduce job?
> I used some options like -D mapreduce.output.fileoutputformat.compress=true
> -D
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
> but it's not work. Did I wrong somewhere?
> If not, please support option for compress output of Hadoop Archive tool,
> it's very neccessary for data retention for everyone (small files problem and
> compress data).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]