[PR] fix(hfile): use Hadoop WritableUtils VarInt encoding in HFile block index writer [hudi]

via GitHub Sat, 04 Apr 2026 07:09:53 -0700


officialasishkumar opened a new pull request, #18465:
URL: https://github.com/apache/hudi/pull/18465


   ### Describe the issue this Pull Request addresses
   
   Closes #18450
   
   The native HFile writer used protobuf varint encoding 
(`CodedOutputStream.writeUInt32NoTag`) for key lengths in root and meta index 
blocks, while the reader used Hadoop `WritableUtils` VarLong decoding. These 
two variable-length integer encodings are incompatible for values >= 128:
   
   - **Protobuf varint**: base-128, little-endian, with MSB continuation bits
   - **Hadoop WritableUtils VarLong**: header byte indicating the number of 
following big-endian value bytes
   
   For keys with content length >= 126 characters (the varint value becomes >= 
128 after adding the 2-byte row key length prefix), the reader misinterpreted 
the protobuf-encoded bytes as a Hadoop VarLong, producing a negative key length 
and causing `NegativeArraySizeException`.
   
   ### Summary and Changelog
   
   This change fixes the encoding mismatch by switching the HFile block index 
writers (`HFileRootIndexBlock` and `HFileMetaIndexBlock`) to use Hadoop 
WritableUtils VarInt encoding, matching both the HBase HFile format and the 
existing reader logic.
   
   - Added `IOUtils.writeVarInt()` method that implements Hadoop-compatible 
variable-length integer encoding
   - Updated `HFileRootIndexBlock.getUncompressedBlockDataToWrite()` to use 
`writeVarInt()` instead of protobuf's `getVariableLengthEncodedBytes()`
   - Updated `HFileMetaIndexBlock.getUncompressedBlockDataToWrite()` similarly
   - Added `testLongKeys` test in `TestHFileWriter` that writes and reads HFile 
entries with 200+ character keys, directly exercising the fix
   - Added `writeVarInt` round-trip and cross-validation tests in `TestIOUtils`
   
   ### Impact
   
   - No public API changes
   - Existing HFiles with key content < 126 characters are unaffected, as both 
encodings produce identical single-byte output for values 0-127
   - HFiles with key content >= 126 characters were previously unreadable 
(caused `NegativeArraySizeException`), and are now correctly handled
   
   ### Risk Level
   
   Low. The encoding change only affects the write path for multi-byte varints 
(key content length >= 126 characters). For shorter keys, the encoding is 
byte-identical. The reader already uses Hadoop VarLong decoding and is 
unchanged. All existing tests pass, including HBase compatibility tests that 
read HBase-generated HFiles.
   
   ### Documentation Update
   
   None.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix(hfile): use Hadoop WritableUtils VarInt encoding in HFile block index writer [hudi]

Reply via email to