[jira] [Commented] (HADOOP-13578) Add Codec for ZStandard Compression

churro morales (JIRA) Fri, 18 Nov 2016 15:57:00 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15678157#comment-15678157
 ]


churro morales commented on HADOOP-13578:
-----------------------------------------

Hi Jason thanks for the review.  I figured out the memory issue and will submit 
a patch monday, it was quite a bit of fun trying to figure out why the JVM was 
segfaulting for particular scenarios :).

io.compression.codec.zst.buffersize should be 
io.compression.codec.zstd.buffersize to have a consistent prefix with 
io.compression.codec.zstd.level
In ZStandardCodec#createOutputStream and ZStandardCodec#createInputStream I'm 
curious why we're not honoring the user's specified buffer size and just using 
the default buffer size. Many codecs use io.file.buffer.size here, but others 
use a codec-specific config property.

bq. I totally agree, I'm going to just use io.file.buffer.size and that will be 
the buffer size and we will honor it. 

In ZStandardCompressor and ZStandardDecompressor, the buffers should simply be 
declared as ByteBuffer instead of Buffer. That way the ugly casting when doing 
bulk get/put with them is unnecessary.
bq. done

In ZStandardDecompressor it's a little confusing that some buffer length 
variables like userBufLen reflect the number of bytes remaining to be consumed 
in the buffer, while compressedDirectBufLen reflects the number of bytes 
originally placed in the buffer, so the real length of data to be consumed is 
(compressedDirectBufLen - compressedDirectBufOffset). Maybe a variable name 
change would help notify future maintainers of the semantic difference with 
similar variables for other buffers, or we should change the semantics so that 
variable is consistent with the others?

bq. Yeah that is confusing, I'll change the name of the variables.  I was just 
following on how other Decompressors did their naming.

I'm a little wary of the public getBytesWritten and getBytesRead methods in 
ZStandardDecompressor. getBytesWritten doesn't appear to be all that useful 
since the caller can trivially compute it based on the amount of data they read 
from the input stream the decompressor is writing. Also both of these methods 
reset values to zero at frame boundaries in the input stream, which may confuse 
callers who are comparing these values to the entire compressed input data 
size. I'd recommend removing getBytesWritten which is unused and making 
getBytesRead package-private just for unit testing.

bq. I originally had that for testing purposes, I totally forgot to remove it.  
I'll get rid of those methods entirely. 

It would be nice to change the logging from org.apache.commons.logging to 
org.slf4j. Sorry I didn't catch this earlier. We're supposed to be slowly 
migrating to SLF4J, and new code should be using it. Fortunately it's an easy 
change, and we can remove the conditional checks for debug logging.

bq. sure thing

Do we need to have test_file.txt? I'm wondering if simply generating random 
data (maybe with a constant seed so the test is deterministic) into a byte 
buffer would be sufficient input for testing. Typically we want to avoid test 
files in the source tree unless there's something very specific about the 
file's contents necessary to exercise a unit test case.

bq. Sorry, I forgot to commit the other file that went along: 
test_file.txt.zst.  That file was compressed using the command line utility and 
wanted to make sure we could go back and forth between command line and hadoop. 
 I think that is useful to have as a test case as if anyone changes that 
behavior down the line, tests will break and it should be pretty clear as to 
why. 

I will fix all coding style issues and other minor nits no problem.

Again thank you so much for taking the time to review this patch.  I will get 
the hopefully final patch up Monday.  










> Add Codec for ZStandard Compression
> -----------------------------------
>
>                 Key: HADOOP-13578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13578
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch, 
> HADOOP-13578.v2.patch, HADOOP-13578.v3.patch
>
>
> ZStandard: https://github.com/facebook/zstd has been used in production for 6 
> months by facebook now.  v1.0 was recently released.  Create a codec for this 
> library.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-13578) Add Codec for ZStandard Compression

Reply via email to