Re: [PR] GH-49641: [C++] Fix Lz4HadoopCodec to split large blocks for Hadoop compatibility [arrow]

via GitHub Thu, 02 Apr 2026 18:25:34 -0700


clee704 commented on PR #49642:
URL: https://github.com/apache/arrow/pull/49642#issuecomment-4181298179


   Thanks for the review @pitrou!
   
   > I would not consider this a critical fix, as this is just working around a 
bug/limitation in another Parquet implementation.
   
   Fair point — I've removed the "Critical Fix" label from the description. 
It's an interoperability fix rather than a correctness issue in Arrow's own 
read path.
   
   > why not use the newer LZ4_RAW which completely solves the Hadoop 
compatibility problem?
   
   Good question. In our case the codec mapping is in a layer above Arrow (SNPW 
maps Spark's `lz4` config → `Compression::LZ4_HADOOP`), so switching to 
`LZ4_RAW` is a separate change there. We plan to do that too.
   
   That said, `Lz4HadoopCodec` exists specifically for Hadoop compatibility and 
is a public codec in Arrow — it should produce output that Hadoop can actually 
read, regardless of whether callers should prefer `LZ4_RAW` for new files. The 
current behavior is arguably a bug in its own contract.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-49641: [C++] Fix Lz4HadoopCodec to split large blocks for Hadoop compatibility [arrow]

Reply via email to