pitrou opened a new issue, #47112:
URL: https://github.com/apache/arrow/issues/47112

   ### Describe the enhancement requested
   
   The Parquet [Bit-packed RLE 
encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
 is used in two performance-critical situations:
   1. definition and repetition levels, which are present on many or even most 
columns (when writing nullable Arrow data, for example, definition levels are 
always generated even if there are no nulls)
   2. dictionary indices for [Dictionary 
encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
   
   Both these areas lack performance currently, and it has [been difficult to 
optimize](github.com/apache/arrow/issues/40845).
   The RLE decoder is currently spread over two facilities: the 
[RleDecoder](https://github.com/apache/arrow/blob/460528956fd146a357bcb1afbed345c2ab9312ba/cpp/src/arrow/util/rle_encoding_internal.h#L87-L88)
 does the actual decoding to a linear array of integers, while the 
[BitReader](https://github.com/apache/arrow/blob/460528956fd146a357bcb1afbed345c2ab9312ba/cpp/src/arrow/util/bit_stream_utils_internal.h#L125-L128)
 hosts some low-level reading routines.
   
   This issue is to refactor the `RleDecoder` into a parser+decoder. The parser 
would be a pure event-driven facility to decompose a RLE stream into its 
individual tokens (. The parser does not need to use the `BitReader` class, it 
can just reimplement the corresponding operations.
   
   (an inspiration for event-driven parser APIs can be found in 
[RapidJSON](https://rapidjson.org/md_doc_sax.html), though the RLE parser would 
be much simpler as there are only two different kinds of tokens in the RLE 
stream: literal runs and repeated runs)
   
   This refactoring will probably not bring any performance benefits in itself 
(though who knows?). However, once we have a RLE parser, we can then avoid the 
decoder in some situations.
   
   For example, a very common situation is OPTIONAL columns in Parquet, where 
definition levels can have only two values: 0 and 1. Currently, we decode such 
levels into 16-bit integers and then re-encode them into a bitmap. This is 
wasteful and pointless, while we could generate the bitmap directly from the 
RLE stream.
   
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to