Polepoint opened a new issue, #43745: URL: https://github.com/apache/arrow/issues/43745
### Describe the bug, including details regarding any error messages, version, and platform. According to the https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/src/codec/Lz4Codec.cc and https://github.com/apache/hadoop/blob/release-3.4.1-RC1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/Lz4Codec.java#L92 the lz4-hadoop should be implemented with block stream, which means that the input maybe split into blocks, each block will be compressed with lz4. The outputs will be like that `- 4 byte big endian uncompressed_size of all blocks` `- 4-byte big endian compressed_size of the flowing block` `< lz4 compressed block >` `- 4-byte big endian compressed_size of the flowing block` `< lz4 compressed block >` `- 4-byte big endian compressed_size of the flowing block` `< lz4 compressed block >` `... repeated until uncompressed_size from outer block is consumed ...` The implement of lz4-hadoop in arrow seems only accept one block, as it will return `kNotHadoop` immediately if the `maybe_decompressed_size` of the first block is not equal to `expected_ decompressed_size`(acturally, it is the size of all blocks's decompressed_size). https://github.com/apache/arrow/blob/release-17.0.0-rc2/cpp/src/arrow/util/compression_lz4.cc#L509 ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
