alamb commented on code in PR #740: URL: https://github.com/apache/arrow-site/pull/740#discussion_r2599070333
########## _posts/2025-12-03-parquet-late-materialization-deep-dive.md: ########## @@ -216,12 +216,18 @@ let plan_builder = override_selector_strategy_if_needed( assert_eq!(plan_builder.row_selection_policy(), &RowSelectionPolicy::Selectors); ``` -### 3.2 Page Level Pruning +### 3.2 Page Pruning -The ultimate performance win is **not reading the disk at all**. In the real world (especially with object storage), firing off a million tiny read requests is a **performance killer**. Since we have the Page Index, `arrow-rs` calculates exactly which Pages contain data we actually need. Even if the underlying storage client merges adjacent requests, the real win is CPU: **we completely skip the heavy lifting of decompressing and decoding pruned pages.** +The ultimate performance win is **not doing I/O or decoding at all**. In the real world (especially with object storage), firing off a million tiny read requests is a **performance killer**. `arrow-rs` uses the Parquet [PageIndex] to calculate exactly which pages contain data we actually need. For very selective predicates, skipping pages can result in substantial I/O savings, even if the underlying storage client merges adjacent range requests. Another major win is reduced CPU: **we completely skip the heavy lifting of decompressing and decoding entirely pruned pages.** -* **The Catch**: If `RowSelection` selects even a **single row** in a Page, the whole Page has to be decompressed and decoded. -* **Implementation**: `scan_ranges` crunches the numbers using each page's metadata (`first_row_index` and `compressed_page_size`) to figure out which ranges are total skips, returning only the essential `(offset, length)` list. The decoder then cleans up the rest using `skip_records` inside the page. +[PageIndex]: https://parquet.apache.org/docs/file-format/pageindex/ + +* **The Catch**: If the `RowSelection` selects even a **single row** from a page, the whole page must be decompressed and decoded. Review Comment: np -- feel free to update however you think best -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
