[jira] [Commented] (HADOOP-13340) Compress Hadoop Archive output

Jason Lowe (JIRA) Wed, 06 Jul 2016 14:57:53 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365180#comment-15365180
 ]


Jason Lowe commented on HADOOP-13340:
-------------------------------------

A Hadoop archive (har) helps solve the small file problem by combining many 
files into a few files, along with an index file to help find the original 
files.  Therefore the data for most original files will appear at some non-zero 
offset within one of the har files.  A gzip stream cannot be decoded at an 
arbitrary offset within the stream, since the symbols in the stream are 
relative to what's appears in the stream before it.  So if we used the gzip 
codec to compress those har files then that disables the ability to seek 
arbitrarily within the har file.  The client would need to uncompress the 
entire file up to the point where the original file data starts which would 
make them very slow to access.

The seek problem can be solved by compressing the original files as separate 
compression streams within the larger har file, but arguably that can be done 
today by compressing the original files before adding them to the har archive.


> Compress Hadoop Archive output
> ------------------------------
>
>                 Key: HADOOP-13340
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13340
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools
>    Affects Versions: 2.5.0
>            Reporter: Duc Le Tu
>              Labels: features, performance
>
> Why Hadoop Archive tool cannot compress output like other map-reduce job? 
> I used some options like -D mapreduce.output.fileoutputformat.compress=true 
> -D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
>  but it's not work. Did I wrong somewhere?
> If not, please support option for compress output of Hadoop Archive tool, 
> it's very neccessary for data retention for everyone (small files problem and 
> compress data).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-13340) Compress Hadoop Archive output

Reply via email to