clee704 opened a new issue, #49641:
URL: https://github.com/apache/arrow/issues/49641

   ### Describe the bug
   
   `Lz4HadoopCodec::Compress` writes the entire input as a single Hadoop-framed 
LZ4 block regardless of size. Hadoop's `Lz4Decompressor` allocates a fixed 256 
KiB output buffer per block (`IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE_DEFAULT = 256 
* 1024`), so any block whose decompressed size exceeds 256 KiB causes 
`LZ4Exception` on JVM readers (parquet-mr + Hadoop `BlockDecompressorStream`).
   
   PARQUET-1878 added `Lz4HadoopCodec` but writes one block per page. 
ARROW-11301 fixed the *reader* for multi-block Hadoop data, but the *writer* 
was never updated to split large inputs the same way Hadoop's 
`BlockCompressorStream` does.
   
   ### Steps to reproduce
   
   Write a Parquet file with `LZ4_HADOOP` compression containing a dictionary 
page >256 KiB (e.g. 40K unique INT64 values = 320 KiB), then read it with a 
JVM-based Parquet reader (parquet-mr + Hadoop).
   
   ### Expected behavior
   
   The file should be readable by JVM-based Parquet readers.
   
   ### Actual behavior
   
   ```
   net.jpountz.lz4.LZ4Exception: Error decoding offset 131193 of input buffer
     at 
net.jpountz.lz4.LZ4JNISafeDecompressor.decompress(LZ4JNISafeDecompressor.java:71)
     at 
org.apache.hadoop.io.compress.lz4.Lz4Decompressor.decompressDirectBuf(Lz4Decompressor.java:278)
     at 
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)
     ...
   ```
   
   ### Severity
   
   Read failure, not data corruption. The bytes on disk are valid LZ4 — Arrow's 
own C++ reader handles them fine. The JVM reader throws a hard exception; it 
does not return wrong data.
   
   ### Component(s)
   
   C++
   
   ### Related issues
   
   ARROW-9177, PARQUET-1878, ARROW-11301


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to