clee704 opened a new pull request, #49642: URL: https://github.com/apache/arrow/pull/49642
### Rationale `Lz4HadoopCodec::Compress` writes the entire input as a single Hadoop-framed LZ4 block. Hadoop's `Lz4Decompressor` allocates a fixed 256 KiB output buffer per block (`IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE_DEFAULT`), so any block whose decompressed size exceeds 256 KiB causes `LZ4Exception` on JVM readers (parquet-mr + Hadoop `BlockDecompressorStream`). This was found when writing Parquet dictionary pages >256 KiB with LZ4 compression (e.g. 40K unique INT64 values = 320 KiB). The file was written successfully but the JVM reader could not decompress the dictionary page. This is a **read failure, not data corruption** — the compressed bytes on disk are valid LZ4. Arrow's own C++ reader decompresses them fine. The JVM reader throws a hard exception; it does not silently return wrong data. PARQUET-1878 added the Hadoop-compatible codec but only writes one block per page. ARROW-11301 fixed the reader for multi-block Hadoop data, but the writer was never updated. ### What changes are included in this PR? Split input into blocks of ≤ 256 KiB in `Lz4HadoopCodec::Compress` and update `MaxCompressedLen` for per-block prefix overhead. Arrow's reader (`TryDecompressHadoop`) already handles multiple blocks. No behavioral change for data ≤ 256 KiB (still produces a single block, identical output to before). ### Are these changes tested? Yes: - `MultiBlockRoundtrip`: compress→decompress round-trip for sizes 0 to 1 MiB - `BlockSizeLimit`: parses compressed output and asserts every block's `decompressed_size ≤ 256 KiB`. **Fails without the fix, passes with it.** ### Are there any user-facing changes? Parquet files written with `LZ4_HADOOP` compression containing pages >256 KiB will now be readable by JVM-based Parquet readers (parquet-mr + Hadoop). Previously these files caused `LZ4Exception` on the JVM reader. No change for files with pages ≤ 256 KiB. Closes #49641. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
