officialasishkumar opened a new pull request, #18465: URL: https://github.com/apache/hudi/pull/18465
### Describe the issue this Pull Request addresses Closes #18450 The native HFile writer used protobuf varint encoding (`CodedOutputStream.writeUInt32NoTag`) for key lengths in root and meta index blocks, while the reader used Hadoop `WritableUtils` VarLong decoding. These two variable-length integer encodings are incompatible for values >= 128: - **Protobuf varint**: base-128, little-endian, with MSB continuation bits - **Hadoop WritableUtils VarLong**: header byte indicating the number of following big-endian value bytes For keys with content length >= 126 characters (the varint value becomes >= 128 after adding the 2-byte row key length prefix), the reader misinterpreted the protobuf-encoded bytes as a Hadoop VarLong, producing a negative key length and causing `NegativeArraySizeException`. ### Summary and Changelog This change fixes the encoding mismatch by switching the HFile block index writers (`HFileRootIndexBlock` and `HFileMetaIndexBlock`) to use Hadoop WritableUtils VarInt encoding, matching both the HBase HFile format and the existing reader logic. - Added `IOUtils.writeVarInt()` method that implements Hadoop-compatible variable-length integer encoding - Updated `HFileRootIndexBlock.getUncompressedBlockDataToWrite()` to use `writeVarInt()` instead of protobuf's `getVariableLengthEncodedBytes()` - Updated `HFileMetaIndexBlock.getUncompressedBlockDataToWrite()` similarly - Added `testLongKeys` test in `TestHFileWriter` that writes and reads HFile entries with 200+ character keys, directly exercising the fix - Added `writeVarInt` round-trip and cross-validation tests in `TestIOUtils` ### Impact - No public API changes - Existing HFiles with key content < 126 characters are unaffected, as both encodings produce identical single-byte output for values 0-127 - HFiles with key content >= 126 characters were previously unreadable (caused `NegativeArraySizeException`), and are now correctly handled ### Risk Level Low. The encoding change only affects the write path for multi-byte varints (key content length >= 126 characters). For shorter keys, the encoding is byte-identical. The reader already uses Hadoop VarLong decoding and is unchanged. All existing tests pass, including HBase compatibility tests that read HBase-generated HFiles. ### Documentation Update None. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
