pitrou opened a new issue, #47112: URL: https://github.com/apache/arrow/issues/47112
### Describe the enhancement requested The Parquet [Bit-packed RLE encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) is used in two performance-critical situations: 1. definition and repetition levels, which are present on many or even most columns (when writing nullable Arrow data, for example, definition levels are always generated even if there are no nulls) 2. dictionary indices for [Dictionary encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) Both these areas lack performance currently, and it has [been difficult to optimize](github.com/apache/arrow/issues/40845). The RLE decoder is currently spread over two facilities: the [RleDecoder](https://github.com/apache/arrow/blob/460528956fd146a357bcb1afbed345c2ab9312ba/cpp/src/arrow/util/rle_encoding_internal.h#L87-L88) does the actual decoding to a linear array of integers, while the [BitReader](https://github.com/apache/arrow/blob/460528956fd146a357bcb1afbed345c2ab9312ba/cpp/src/arrow/util/bit_stream_utils_internal.h#L125-L128) hosts some low-level reading routines. This issue is to refactor the `RleDecoder` into a parser+decoder. The parser would be a pure event-driven facility to decompose a RLE stream into its individual tokens (. The parser does not need to use the `BitReader` class, it can just reimplement the corresponding operations. (an inspiration for event-driven parser APIs can be found in [RapidJSON](https://rapidjson.org/md_doc_sax.html), though the RLE parser would be much simpler as there are only two different kinds of tokens in the RLE stream: literal runs and repeated runs) This refactoring will probably not bring any performance benefits in itself (though who knows?). However, once we have a RLE parser, we can then avoid the decoder in some situations. For example, a very common situation is OPTIONAL columns in Parquet, where definition levels can have only two values: 0 and 1. Currently, we decode such levels into 16-bit integers and then re-encode them into a bitmap. This is wasteful and pointless, while we could generate the bitmap directly from the RLE stream. ### Component(s) C++, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
