clee704 opened a new issue, #49641:
URL: https://github.com/apache/arrow/issues/49641
### Describe the bug
`Lz4HadoopCodec::Compress` writes the entire input as a single Hadoop-framed
LZ4 block regardless of size. Hadoop's `Lz4Decompressor` allocates a fixed 256
KiB output buffer per block (`IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE_DEFAULT = 256
* 1024`), so any block whose decompressed size exceeds 256 KiB causes
`LZ4Exception` on JVM readers (parquet-mr + Hadoop `BlockDecompressorStream`).
PARQUET-1878 added `Lz4HadoopCodec` but writes one block per page.
ARROW-11301 fixed the *reader* for multi-block Hadoop data, but the *writer*
was never updated to split large inputs the same way Hadoop's
`BlockCompressorStream` does.
### Steps to reproduce
Write a Parquet file with `LZ4_HADOOP` compression containing a dictionary
page >256 KiB (e.g. 40K unique INT64 values = 320 KiB), then read it with a
JVM-based Parquet reader (parquet-mr + Hadoop).
### Expected behavior
The file should be readable by JVM-based Parquet readers.
### Actual behavior
```
net.jpountz.lz4.LZ4Exception: Error decoding offset 131193 of input buffer
at
net.jpountz.lz4.LZ4JNISafeDecompressor.decompress(LZ4JNISafeDecompressor.java:71)
at
org.apache.hadoop.io.compress.lz4.Lz4Decompressor.decompressDirectBuf(Lz4Decompressor.java:278)
at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)
...
```
### Severity
Read failure, not data corruption. The bytes on disk are valid LZ4 — Arrow's
own C++ reader handles them fine. The JVM reader throws a hard exception; it
does not return wrong data.
### Component(s)
C++
### Related issues
ARROW-9177, PARQUET-1878, ARROW-11301
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]