[ 
https://issues.apache.org/jira/browse/HADOOP-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366181#comment-15366181
 ] 

Jason Lowe commented on HADOOP-13340:
-------------------------------------

Yes a splittable codec could be used to accomplish something simmilar, but 
again the splits won't necessarily occur on the file boundaries -- depending 
upon the codec block size a significant amount of data may need to be 
decompressed and thrown away before arriving at the original file data within 
the codec block.  (Note that a larger codec block size could compress multiple 
original small files and get a better overall compression ratio, so there's a 
tradeoff.)

As I mentioned above, if the intent is to compress the original files on file 
boundaries when adding them to the har then IMHO the problem is the original 
files should have been compressed in the first place before trying to do the 
har.  Otherwise those intending to consume the original files will find 
compressed data in the har rather than original file data and will need to know 
that they need a codec to get back to the original file contents.  If the 
purpose of this request is to provide transparent compression within the har 
then that will need a splittable codec or reset the codec on file boundaries 
and set flags in the har (in a backwards-compatible manner) to indicate how 
compression was performed so the resulting input stream can compensate for how 
the data is laid out within the har.

> Compress Hadoop Archive output
> ------------------------------
>
>                 Key: HADOOP-13340
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13340
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools
>    Affects Versions: 2.5.0
>            Reporter: Duc Le Tu
>              Labels: features, performance
>
> Why Hadoop Archive tool cannot compress output like other map-reduce job? 
> I used some options like -D mapreduce.output.fileoutputformat.compress=true 
> -D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
>  but it's not work. Did I wrong somewhere?
> If not, please support option for compress output of Hadoop Archive tool, 
> it's very neccessary for data retention for everyone (small files problem and 
> compress data).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to