[jira] [Updated] (HADOOP-13578) Add Codec for ZStandard Compression

churro morales (JIRA) Tue, 06 Dec 2016 11:57:33 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


churro morales updated HADOOP-13578:
------------------------------------
    Attachment: HADOOP-13578.v5.patch

Hi [~jlowe] first of all thank you for taking the time to review this code.  I 
have attached another patch and addressed the comments you had below.

I'm torn on the buffer size, whether we should use the generic 
io.file.buffer.size or have a zst-specific config. If the codec can 
significantly speed up with a larger default buffer size than the file buffer 
size then maybe we should consider having a separate, zst-specific buffer size 
config. I'd recommend a zst-specific config key with a default value of 0 that 
means use whatever buffer size the zst library wants to use, but the user can 
override it to a custom size. That way users can change the zst codec buffer 
size (again, useful for keeping memory usage reasonable for very wide sort 
factor merges) but not change the buffer sizes of anything else. Similarly, 
someone changing the default buffer size to something relatively small (e.g.: 
4K) for some unrelated use case could unknowingly hamper the performance of the 
zst codec.

bq. Thats fine, created a new configuration option for buffer size (which is 
codec specific, with a default value of 0), if the value is 0 then we use the 
recommended library buffer size unless the user specifically overrides it.

Regarding the test input file, is it copyrighted or do we otherwise have the 
redistribution rights? Also the .zst binary test file is missing from the patch.

bq. I updated the file to one that I know is not copyrighted and added a 
license header.  Also added the binary.

ZStandardCodec.java
Nit: should import the specific things needed rather than wildcard imports.

bq. Fixed

ZStandardCompressor.java
reinit calls reset then init, but reset already is calling init.
bq. removed the call to init from reinit

LOG should be private rather than public.

bq. done

The LOG.isDebugEnabled check is unnecessary.

bq. done

Nit: The VisibleForTesting annotation should be on a separate line like the 
other annotations.

bq. done

ZStandardCompressor.c:
Shouldn't we throw if GetDirectBufferAddress returns null?

bq. added a check to throw when we can't resolve a memory address for the 
buffer and we get back null.

>From the zstd documentation, ZSTD_compressStream may not always fully consume 
>the input. Therefore if the finish flag is set in deflateBytesDirect I think 
>we need to avoid calling ZTD_endStream if there is still bytes left in the 
>input or we risk dropping some of the input. We can just return without 
>setting the finished flag and expect to get called again from the outer 
>compression loop logic.

Similarly the documentation for ZSTD_endStream says that it may not be able to 
flush the full contents of the data in one call, and if it indicates there is 
more data left to flush then we should call it again. So if we tried to end the 
stream and there's more data left then we shouldn't throw an error. Instead we 
simply leave the finished flag unset and return so the outer logic loop will 
empty the output buffer and call compress() again. We will then again call 
ZSTD_endStream to continue attempting to finish the compression. Bonus points 
for adding a test case that uses a 1-byte output buffer to demonstrate the 
extreme case is working properly.

bq. Sure thing, I think not throwing if there is data remaining to flush when 
endStream is called and instead setting finished to false will resolve both 
issues.  I added a test called testCompressingWithOneByteOutputBuffer()

Is it necessary to call ZSTD_flushStream after every ZSTD_compressStream call? 
Seems to me we can skip it entirely, but I might be missing something.

bq. To keep consistent with the API unfortunately I have to call flushStream.  
The input and output buffers of the zstandard library "pos" fields are only 
updated when the stream is flushed so unfortunately that is necessary with the 
Compressor / Decompressor framework.

ZStandardDecompressor.java:
LOG should be private rather than public.

bq. done

Should there be a checkStream call at the beginning of decompress()? This would 
mirror the same check in the compressor for the corresponding compress() method.

bq. done

In inflateDirect, the setting of preSliced = dst in the conditional block is 
redundant since it was just done before the block. Actually I think it would be 
faster and create less garbage to avoid creating a temporary ByteBuffer and 
have the JNI code to lookup the output buffer's position and limit and adjust 
accordingly. That way we aren't creating an object each time.

bq. yes that is not necessary, passing parameters in will fix that. 

The swapping of variables in inflateDirect is cumbersome. It would be much 
cleaner if inflateBytesDirect took arguments so we can pass what the JNI needs 
directly to it rather than shuffling the parameters into the object's fields 
before we call it. That also cleans up the JNI code a bit since it doesn't need 
to do many of the field lookups. For example:

  private native int inflateBytesDirect(ByteBuffer src, int srcOffset, int 
srcLen, ByteBuffer dst, int dstOffset, int dstLen);

bq. inflateBytesDirect is now parameterized.

We could simplify the variables even more if the JNI code called the ByteBuffer 
methods to update the positions and limits directly rather than poking the 
values into the Java object and having Java do it when the JNI returns. I'll 
leave the details up to you, just pointing out that we can clean this up and 
avoid a lot of swapping. We could do a similar thing on the compressor side.

bq. We could but we have to do things like getting the object class, then the 
method ids we just push the work and complexity into the c code.  I fixed the 
variable swapping in the direct decompressor and also parametrized the inflate 
and deflate methods to make things cleaner.

I'm a little confused about the endOfInput flag for the direct decompressor. I 
assume it's for handling the case where there are multiple frames within the 
input buffer. The JNI code will set finished = true once it hits the end of the 
frame, and it appears it will remain that way even if more input is sent (i.e.: 
finished will go from false to true after the first frame in the input buffer 
and remain true throughout all subsequent frames). It makes me wonder if we 
should be setting finished to false if we didn't see the end of a frame this 
iteration?

bq. agreed you are correct we should be setting finished to false if we don't 
see end of frame for an iteration.  This would allow for the multiple frames to 
be handled correctly.

ZStandardDecompressor.c
The handling of result codes from ZSTD_freeCStream is different between the 
compressor and decompressor. Not sure a non-zero result means unflushed data, 
and I'd recommend updating the handling here to match what is done in the 
compressor JNI code.

bq. good catch, will fix it to check for error and throw.

Shouldn't we throw if GetDirectBufferAddress returns null?
bq. I will throw instead of returning 0 if we can't get a memory address, I 
agree with your sentiments.

The x ^ ((x ^ y) & -(x < y)) expression is clever, but it's probably too smart 
for many compilers to realize what this is and generate the optimal code for 
max() on that platform. I tested this on gcc for x64 and it generates smaller, 
better code if written as a naive implementation than with this expression. 
Therefore I recommend we just do the simple thing here (i.e.: ternary operator 
or plain ol' if statement) which will likely be faster in practice (because 
compilers can easily detect and optimize for it) and easier to understand.

Sounds good you make a good point.  I'll change it.

TestZStandardCompressorDecompressor.java:
org.junit.Ignore import is unnecessary

Nit: Should not do wildcard imports but only import what is necessary

bq. done. 

uncompressedFile and compressedFile lookups only need to be done once, so we 
can do this in the default constructor rather than before every test in the 
Before method (and those fields could be marked final).

bq. will put then in a static before class. 

I forgot to mention it would also be good to address at least some of the 
checkstyle issues, such as the indentation levels and line-length.

bq. fixed everything java related.  The c code wraps quite a bit if we keep by 
the standards and then becomes unreadable.  Many of the method names due to jni 
are longer than 80 chars themselves which cannot be wrapped.  I looked at other 
c code in the compress package and none of them follow the line length 
guidelines either.  I assume that its because at a line length of 80 everything 
would become wrapped and many things simply just can't be wrapped.




> Add Codec for ZStandard Compression
> -----------------------------------
>
>                 Key: HADOOP-13578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13578
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch, 
> HADOOP-13578.v2.patch, HADOOP-13578.v3.patch, HADOOP-13578.v4.patch, 
> HADOOP-13578.v5.patch
>
>
> ZStandard: https://github.com/facebook/zstd has been used in production for 6 
> months by facebook now.  v1.0 was recently released.  Create a codec for this 
> library.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-13578) Add Codec for ZStandard Compression

Reply via email to