[
https://issues.apache.org/jira/browse/HADOOP-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15711831#comment-15711831
]
Steve Loughran commented on HADOOP-13849:
-----------------------------------------
Well, if you want to work on it, feel free.
however, know that the native codec uses the standard {{libbz2}}; there's not
much that can be done in the Hadoop code to speed that up other than any
improvements in how data is moved between the Java memory structures and those
of libbz...if there are memory copies taking place then that could be hurting
performance. Anything that can help there would be good.
bq. I think the "system native" should have better compress/decompress
performance than "java builtin".
That's something to explore. The latest Java 8 compilers are fast, and if the
algorithms aren't doing lots of object creation, then bit operations in Java
should be on a par with C-language actions against general registers. Where you
would expect differences is if the native code uses some special CPU registers
and operations (example, Intel SSE2) for significant performance. I don't know
if bzip does that.
The fun part in benchmarking is isolating things. For codec performance, maybe
have some test data being pre generated in CPU & cached in RAM. in standard
formats (avro, orc), and the different codecs, then compressing that to RAM not
HDD, so that the compression code is isolated from Disk IO, etc, etc.
If the isolated native code is faster than the java one, then the implication
is that the bottleneck is elsewhere in the workflow, not the codec. Again:
that's interesting information.
bq. My hardware CPU/Memory/Network bandwidh/Disk bandwidh are not bottleneck
one of them is. Always —and it can be things like CPU cache latencies, excess
synchronization in the code, even branch-misprediction in the CPU can hurt
efficiency. FWIW, Flamegraphs are current the tool of choice for visualising
performance during microbenchmarks
> Bzip2 java-builtin and system-native have almost the same compress speed
> ------------------------------------------------------------------------
>
> Key: HADOOP-13849
> URL: https://issues.apache.org/jira/browse/HADOOP-13849
> Project: Hadoop Common
> Issue Type: Bug
> Components: common
> Affects Versions: 2.6.0
> Environment: os version: redhat6
> hadoop version: 2.6.0
> native bzip2 version: bzip2-devel-1.0.5-7.el6_0.x86_64
> Reporter: Tao Li
>
> I tested bzip2 java-builtin and system-native compression, and I found the
> compress speed is almost the same. (I think the system-native should have
> better compress speed than java-builtin)
> My test case:
> 1. input file: 2.7GB text file without compression
> 2. after bzip2 java-builtin compress: 457MB, 12min 4sec
> 3. after bzip2 system-native compress: 457MB, 12min 19sec
> My MapReduce Config:
> conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false");
> conf.set("mapreduce.output.fileoutputformat.compress", "true");
> conf.set("mapreduce.output.fileoutputformat.compress.type", "BLOCK");
> conf.set("mapreduce.output.fileoutputformat.compress.codec",
> "org.apache.hadoop.io.compress.BZip2Codec");
> conf.set("io.compression.codec.bzip2.library", "java-builtin"); // for
> java-builtin
> conf.set("io.compression.codec.bzip2.library", "system-native"); // for
> system-native
> And I am sure I have enable the bzip2 native, the output of command "hadoop
> checknative -a" is as follows:
> Native library checking:
> hadoop: true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
> zlib: true /lib64/libz.so.1
> snappy: true /usr/lib/hadoop/lib/native/libsnappy.so.1
> lz4: true revision:99
> bzip2: true /lib64/libbz2.so.1
> openssl: true /usr/lib64/libcrypto.so
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]