HippoBaro opened a new pull request, #9653: URL: https://github.com/apache/arrow-rs/pull/9653
# Which issue does this PR close? - Closes #9652. # Rationale for this change See issue for details. The Parquet column writer currently does per-value work during level encoding regardless of data sparsity, even though the output encoding (RLE) is proportional to the number of runs. # What changes are included in this PR? Three incremental commits, each building on the previous: 1. **Fuse level encoding with counting and histogram updates.** `write_mini_batch()` previously made three separate passes over each level array: count non-nulls, update the level histogram, and RLE-encode. Now all three happen in a single pass via an observer callback on `LevelEncoder`. When the RLE encoder enters accumulation mode, the loop scans ahead for the full run length and batches the observer call. This makes counting and histogram updates O(1) per run. 2. **Batch consecutive null/empty rows in `write_list`.** Consecutive null or empty list entries are now collapsed into a single `visit_leaves()` call that bulk-extends all leaf level buffers, instead of one tree traversal per null row. Mirrors the approach already used by `write_struct()`. 3. **Short-circuit entirely-null columns.** When every element in an array is null, skip `Vec<i16>` level-buffer materialization entirely and store a compact `(def_value, rep_value, count)` tuple. The writer encodes this via `RleEncoder::put_n()` in O(1) amortized time, bypassing the normal mini-batch loop. # Are these changes tested? All tests passing. I added some benchmark to exercice the heavy and all-null code paths, alongside the existing 25% sparseness benchmarks: ``` Name Before After Delta primitive_all_null/default 37.5 ms 0.20 ms (−99.5%) primitive_all_null/zstd 37.1 ms 0.30 ms (−99.2%) primitive_sparse_99pct_null/default 42.5 ms 15.7 ms (−62.9%) primitive_sparse_99pct_null/p2 42.4 ms 15.9 ms (−62.4%) list_prim_sparse_99pct_null/default 40.8 ms 11.2 ms (−72.4%) list_prim_sparse_99pct_null/p2 40.8 ms 10.7 ms (−73.8%) bool/default 12.7 ms 10.3 ms (−18.7%) primitive/default 124.1 ms 104.6 ms (−15.6%) string_and_binary_view/default 46.3 ms 41.6 ms (−10.1%) list_primitive/default 253.9 ms 235.3 ms (−7.4%) string_dictionary/default 46.2 ms 43.8 ms (−5.3%) ``` Non-nullable column benchmarks are within noise, as expected since they have no definition levels to optimize. # Are there any user-facing changes? None. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
