(arrow-site) 01/02: Updates to section 2.1

alamb Fri, 05 Dec 2025 03:05:16 -0800

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch alamb/sec2
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


commit 10f28d7741ce31c3f5fe1ffa55d0f43206bcae32
Author: Andrew Lamb <[email protected]>
AuthorDate: Fri Dec 5 06:01:26 2025 -0500

    Updates to section 2.1
---
 ...12-03-parquet-late-materialization-deep-dive.md | 25 ++++++++++++++--------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/_posts/2025-12-03-parquet-late-materialization-deep-dive.md 
b/_posts/2025-12-03-parquet-late-materialization-deep-dive.md
index 74f4d18befd..919bd68b033 100644
--- a/_posts/2025-12-03-parquet-late-materialization-deep-dive.md
+++ b/_posts/2025-12-03-parquet-late-materialization-deep-dive.md
@@ -56,22 +56,29 @@ The rest of this post zooms into how the code makes this 
path work.
 
 "LM-pipelined" might sound like something from a textbook. In `arrow-rs`, it 
simply refers to a pipeline that runs sequentially: "read predicate column → 
generate row selection → read data column". This contrasts with a **parallel** 
strategy, where all predicate columns are read simultaneously. While 
parallelism can maximize multi-core CPU usage, the pipelined approach is often 
superior in columnar storage because each filtering step drastically reduces 
the amount of data subsequent step [...]
 
-To achieve this, we defined these core roles:
+The code is structured into a few core roles:
 
 -   
**[ReadPlan](https://github.com/apache/arrow-rs/blob/bab30ae3d61509aa8c73db33010844d440226af2/parquet/src/arrow/arrow_reader/read_plan.rs#L302)
 / 
[ReadPlanBuilder](https://github.com/apache/arrow-rs/blob/bab30ae3d61509aa8c73db33010844d440226af2/parquet/src/arrow/arrow_reader/read_plan.rs#L34)**:
 Encodes "which columns to read and with what row subset" into a plan. It does 
not pre-read all predicate columns. It reads one, tightens the selection, and 
then moves on.
--   
**[RowSelection](https://github.com/apache/arrow-rs/blob/bab30ae3d61509aa8c73db33010844d440226af2/parquet/src/arrow/arrow_reader/selection.rs#L139)**:
 Describes "skip/select N rows" via RLE (`RowSelector::select/skip`) or a 
bitmask. This is the core mechanism that carries sparsity through the pipeline.
--   
**[ArrayReader](https://github.com/apache/arrow-rs/blob/bab30ae3d61509aa8c73db33010844d440226af2/parquet/src/arrow/array_reader/mod.rs#L85)**:
 The component responsible for I/O and decoding. It receives a `RowSelection` 
and decides which pages to read and which values to decode.
+-   
**[RowSelection](https://github.com/apache/arrow-rs/blob/bab30ae3d61509aa8c73db33010844d440226af2/parquet/src/arrow/arrow_reader/selection.rs#L139)**:
 Describes "skip/select N rows" using either [Run-length encoding] (RLE) 
(called a [`RowSelector`]) or a bitmask. This is the core mechanism that 
carries sparsity through the pipeline.
+-   
**[ArrayReader](https://github.com/apache/arrow-rs/blob/bab30ae3d61509aa8c73db33010844d440226af2/parquet/src/arrow/array_reader/mod.rs#L85)**:
 Responsible for decoding. It receives a `RowSelection` and decides which pages 
to read and which values to decode.
 
-> `RowSelection` can switch dynamically between RLE (selectors) and bitmasks. 
Bitmasks are faster when gaps are tiny and sparsity is high; RLE is friendlier 
to large, page-level skips. Details on this trade-off appear in section 3.1.
+[Run-length encoding]: https://en.wikipedia.org/wiki/Run-length_encoding
+[`RowSelector`]: 
https://github.com/apache/arrow-rs/blob/bab30ae3d61509aa8c73db33010844d440226af2/parquet/src/arrow/arrow_reader/selection.rs#L66
 
-Consider a query with two filters: `SELECT * FROM table WHERE A > 10 AND B < 
5`:
+`RowSelection` can switch dynamically between RLE and bitmasks. Bitmasks are 
faster when gaps are tiny and sparsity is high; RLE is friendlier to large, 
page-level skips. Details on this trade-off appear in section 3.1.
+
+Consider again the query: `SELECT B, C  FROM table WHERE A > 10 AND B < 5`:
 
 1.  **Initial**: `selection = None` (equivalent to "select all").
-2.  **Read A**: `ArrayReader` decodes column A in batches; the predicate 
builds a boolean mask; `RowSelection::from_filters` turns it into a sparse 
selection.
-3.  **Tighten**: `ReadPlanBuilder::with_predicate` chains the new mask via 
`RowSelection::and_then`.
+2.  **Read A**: `ArrayReader` decodes column A in batches; the predicate 
builds a boolean mask; [`RowSelection::from_filters`] turns it into a sparse 
selection.
+3.  **Tighten**: [`ReadPlanBuilder::with_predicate`] chains the new mask via 
[`RowSelection::and_then`].
 4.  **Read B**: Build column B's reader with the current `selection`; the 
reader only performs I/O and decode for selected rows, producing an even 
sparser mask.
 5.  **Merge**: `selection = selection.and_then(selection_b)`; projection 
columns now decode a tiny row set.
 
+[`RowSelection::from_filters`]: 
https://github.com/apache/arrow-rs/blob/bab30ae3d61509aa8c73db33010844d440226af2/parquet/src/arrow/arrow_reader/selection.rs#L149
+[`ReadPlanBuilder::with_predicate`]: 
https://github.com/apache/arrow-rs/blob/bab30ae3d61509aa8c73db33010844d440226af2/parquet/src/arrow/arrow_reader/read_plan.rs#L143
+[`RowSelection::and_then`]: 
https://github.com/apache/arrow-rs/blob/bab30ae3d61509aa8c73db33010844d440226af2/parquet/src/arrow/arrow_reader/selection.rs#L345
+
 **Code locations and sketch**:
 
 ```rust
@@ -91,13 +98,13 @@ let plan = builder.build();
 let reader = ParquetRecordBatchReader::new(array_reader, plan);
 ```
 
-I've drawn a simple flowchart to help you understand:
+I've drawn a simple flowchart that illustrates this flow to help you 
understand:
 
 <figure style="text-align: center;">
   <img src="{{ site.baseurl }}/img/late-materialization/fig2.jpg" 
alt="Predicate-first pipeline flow" width="100%" class="img-responsive">
 </figure>
 
-Once the pipeline exists, the next question is **how to represent and combine 
these sparse selections** (the **Row Mask** in the diagram), which is where 
`RowSelection` comes in.
+Now that you understand how this pipeline works, the next question is **how to 
represent and combine these sparse selections** (the **Row Mask** in the 
diagram), which is where `RowSelection` comes in.
 
 ### 2.2 Logical ops on row selectors (`RowSelection::and_then`)

(arrow-site) 01/02: Updates to section 2.1

Reply via email to