[
https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15677114#comment-15677114
]
Jason Lowe commented on HADOOP-13578:
-------------------------------------
Looking forward to the updated patch with the memory fixes. Some comments on
the latest patch. I didn't get time to look at the native code or unit test
changes in detail, but here are the comments I had thus far:
io.compression.codec.zst.buffersize should be
io.compression.codec.zstd.buffersize to have a consistent prefix with
io.compression.codec.zstd.level
In ZStandardCodec#createOutputStream and ZStandardCodec#createInputStream I'm
curious why we're not honoring the user's specified buffer size and just using
the default buffer size. Many codecs use io.file.buffer.size here, but others
use a codec-specific config property.
In ZStandardCompressor and ZStandardDecompressor, the buffers should simply be
declared as ByteBuffer instead of Buffer. That way the ugly casting when doing
bulk get/put with them is unnecessary.
In ZStandardDecompressor it's a little confusing that some buffer length
variables like userBufLen reflect the number of bytes remaining to be consumed
in the buffer, while compressedDirectBufLen reflects the number of bytes
originally placed in the buffer, so the real length of data to be consumed is
(compressedDirectBufLen - compressedDirectBufOffset). Maybe a variable name
change would help notify future maintainers of the semantic difference with
similar variables for other buffers, or we should change the semantics so that
variable is consistent with the others?
I'm a little wary of the public getBytesWritten and getBytesRead methods in
ZStandardDecompressor. getBytesWritten doesn't appear to be all that useful
since the caller can trivially compute it based on the amount of data they read
from the input stream the decompressor is writing. Also both of these methods
reset values to zero at frame boundaries in the input stream, which may confuse
callers who are comparing these values to the entire compressed input data
size. I'd recommend removing getBytesWritten which is unused and making
getBytesRead package-private just for unit testing.
It would be nice to change the logging from org.apache.commons.logging to
org.slf4j. Sorry I didn't catch this earlier. We're supposed to be slowly
migrating to SLF4J, and new code should be using it. Fortunately it's an easy
change, and we can remove the conditional checks for debug logging.
Do we need to have test_file.txt? I'm wondering if simply generating random
data (maybe with a constant seed so the test is deterministic) into a byte
buffer would be sufficient input for testing. Typically we want to avoid test
files in the source tree unless there's something very specific about the
file's contents necessary to exercise a unit test case.
Typos and coding style minor nits:
- ZStandardCodec
-- conf should be private in ZStandardCodec since it isn't accessed outside of
the class.
-- "objec." s/b "object." and "ZStandard Decompressor" s/b
"ZStandardDecompressor"
- ZStandardCompressor
-- Comment refers to ZLIB format
-- Missing whitespace in "(b== null)" in setInput
-- Missing braces around "keepUncompressedBuf && uncompressedDirectBufLen > 0"
conditional block in needsInput
-- 'else' should be on same line as closing brace in compress
-- 'atmost' s/b 'at most'
-- Missing braces around conditional block in checkStream
- ZStandardDecompressor
-- Annotation should appear on line before getDirectBufferSize method.
> Add Codec for ZStandard Compression
> -----------------------------------
>
> Key: HADOOP-13578
> URL: https://issues.apache.org/jira/browse/HADOOP-13578
> Project: Hadoop Common
> Issue Type: New Feature
> Reporter: churro morales
> Assignee: churro morales
> Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch,
> HADOOP-13578.v2.patch, HADOOP-13578.v3.patch
>
>
> ZStandard: https://github.com/facebook/zstd has been used in production for 6
> months by facebook now. v1.0 was recently released. Create a codec for this
> library.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]