Polepoint opened a new issue, #43745:
URL: https://github.com/apache/arrow/issues/43745

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   According to the 
   
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/src/codec/Lz4Codec.cc
   and
   
https://github.com/apache/hadoop/blob/release-3.4.1-RC1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/Lz4Codec.java#L92
   the lz4-hadoop should be implemented with block stream, which means that the 
input maybe split into blocks, each block will be compressed with lz4. 
   The outputs will be like that
   `- 4 byte big endian uncompressed_size of all blocks`
   `- 4-byte big endian compressed_size of the flowing block`
    `< lz4 compressed block >`
   `- 4-byte big endian compressed_size of the flowing block`
    `< lz4 compressed block >`
   `- 4-byte big endian compressed_size of the flowing block`
    `< lz4 compressed block >`
   `... repeated until uncompressed_size from outer block is consumed ...`
   
   The implement of lz4-hadoop in arrow seems only accept one block, as it will 
return `kNotHadoop` immediately if the `maybe_decompressed_size` of the first 
block is not equal to `expected_ decompressed_size`(acturally, it is the size 
of all blocks's decompressed_size).
   
https://github.com/apache/arrow/blob/release-17.0.0-rc2/cpp/src/arrow/util/compression_lz4.cc#L509
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to