[I] [C++][Parquet] Refactor RLE decoding by extract a RLE parser [arrow]

via GitHub Tue, 15 Jul 2025 01:02:15 -0700


pitrou opened a new issue, #47112:
URL: https://github.com/apache/arrow/issues/47112

### Describe the enhancement requested

The Parquet [Bit-packed RLE
encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
is used in two performance-critical situations:
1. definition and repetition levels, which are present on many or even most
columns (when writing nullable Arrow data, for example, definition levels are
always generated even if there are no nulls)
2. dictionary indices for [Dictionary
encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)

Both these areas lack performance currently, and it has [been difficult to
optimize](github.com/apache/arrow/issues/40845).
The RLE decoder is currently spread over two facilities: the
[RleDecoder](https://github.com/apache/arrow/blob/460528956fd146a357bcb1afbed345c2ab9312ba/cpp/src/arrow/util/rle_encoding_internal.h#L87-L88)
does the actual decoding to a linear array of integers, while the
[BitReader](https://github.com/apache/arrow/blob/460528956fd146a357bcb1afbed345c2ab9312ba/cpp/src/arrow/util/bit_stream_utils_internal.h#L125-L128)
hosts some low-level reading routines.

This issue is to refactor the `RleDecoder` into a parser+decoder. The parser
would be a pure event-driven facility to decompose a RLE stream into its
individual tokens (. The parser does not need to use the `BitReader` class, it
can just reimplement the corresponding operations.

(an inspiration for event-driven parser APIs can be found in
[RapidJSON](https://rapidjson.org/md_doc_sax.html), though the RLE parser would
be much simpler as there are only two different kinds of tokens in the RLE
stream: literal runs and repeated runs)

This refactoring will probably not bring any performance benefits in itself
(though who knows?). However, once we have a RLE parser, we can then avoid the
decoder in some situations.

For example, a very common situation is OPTIONAL columns in Parquet, where
definition levels can have only two values: 0 and 1. Currently, we decode such
levels into 16-bit integers and then re-encode them into a bitmap. This is
wasteful and pointless, while we could generate the bitmap directly from the
RLE stream.

### Component(s)

C++, Parquet

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [C++][Parquet] Refactor RLE decoding by extract a RLE parser [arrow]

Reply via email to