[PR] GH-49641: [C++] Fix Lz4HadoopCodec to split large blocks for Hadoop compatibility [arrow]

via GitHub Wed, 01 Apr 2026 21:21:25 -0700


clee704 opened a new pull request, #49642:
URL: https://github.com/apache/arrow/pull/49642


   ### Rationale
   
   `Lz4HadoopCodec::Compress` writes the entire input as a single Hadoop-framed 
LZ4 block. Hadoop's `Lz4Decompressor` allocates a fixed 256 KiB output buffer 
per block (`IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE_DEFAULT`), so any block whose 
decompressed size exceeds 256 KiB causes `LZ4Exception` on JVM readers 
(parquet-mr + Hadoop `BlockDecompressorStream`).
   
   This was found when writing Parquet dictionary pages >256 KiB with LZ4 
compression (e.g. 40K unique INT64 values = 320 KiB). The file was written 
successfully but the JVM reader could not decompress the dictionary page.
   
   This is a **read failure, not data corruption** — the compressed bytes on 
disk are valid LZ4. Arrow's own C++ reader decompresses them fine. The JVM 
reader throws a hard exception; it does not silently return wrong data.
   
   PARQUET-1878 added the Hadoop-compatible codec but only writes one block per 
page. ARROW-11301 fixed the reader for multi-block Hadoop data, but the writer 
was never updated.
   
   ### What changes are included in this PR?
   
   Split input into blocks of ≤ 256 KiB in `Lz4HadoopCodec::Compress` and 
update `MaxCompressedLen` for per-block prefix overhead. Arrow's reader 
(`TryDecompressHadoop`) already handles multiple blocks. No behavioral change 
for data ≤ 256 KiB (still produces a single block, identical output to before).
   
   ### Are these changes tested?
   
   Yes:
   - `MultiBlockRoundtrip`: compress→decompress round-trip for sizes 0 to 1 MiB
   - `BlockSizeLimit`: parses compressed output and asserts every block's 
`decompressed_size ≤ 256 KiB`. **Fails without the fix, passes with it.**
   
   ### Are there any user-facing changes?
   
   Parquet files written with `LZ4_HADOOP` compression containing pages >256 
KiB will now be readable by JVM-based Parquet readers (parquet-mr + Hadoop). 
Previously these files caused `LZ4Exception` on the JVM reader. No change for 
files with pages ≤ 256 KiB.
   
   Closes #49641.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] GH-49641: [C++] Fix Lz4HadoopCodec to split large blocks for Hadoop compatibility [arrow]

Reply via email to