This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git


The following commit(s) were added to refs/heads/main by this push:
     new b8a2c1ad9e [parquet] Avoid a clone while resolving the read strategy 
(#9056)
b8a2c1ad9e is described below

commit b8a2c1ad9ea7a1b59350735ef3c52e6397406768
Author: Andrew Lamb <[email protected]>
AuthorDate: Mon Jan 5 13:35:31 2026 -0500

    [parquet] Avoid a clone while resolving the read strategy (#9056)
    
    # Which issue does this PR close?
    
    <!--
    We generally require a GitHub issue to be filed for all bug fixes and
    enhancements and this helps us generate change logs for our releases.
    You can link an issue to this PR using the GitHub syntax.
    -->
    
    - related to https://github.com/apache/datafusion/pull/19477
    
    # Rationale for this change
    
    While working on https://github.com/apache/datafusion/pull/19477, and
    profiling ClickBench q7, I noticed that the RowSelectors was being
    cloned to resolve the strategy -- for a large number of selections this
    is expensive and shows up in the traces
    
    <img width="1724" height="1074" alt="Screenshot 2025-12-28 at 4 49
    49 PM"
    
src="https://github.com/user-attachments/assets/72c6fd22-9377-48ef-ba80-6bc03b177cf7";
    />
    
    
    ```shell
    samply record -- ./datafusion-cli-alamb_enable_pushdown  -f q.sql  > 
/dev/null  2>&
    ```
    
    We should change the code to avoid cloning the RowSelectors when
    resolving the strategy.
    
    # Changes
    
    Don't clone / allocate while resolving the strategy.
    
    I don't expect this to have a massive impact, but it did show up in the
    profile
    
    FYI @hhhizzz -- perhaps you could review this PR
    
    
    # Are these changes tested?
    
    Yes by CI
    
    # Are there any user-facing changes?
    small performance improvement
---
 parquet/src/arrow/arrow_reader/read_plan.rs | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/parquet/src/arrow/arrow_reader/read_plan.rs 
b/parquet/src/arrow/arrow_reader/read_plan.rs
index 3c17a358f0..7c9eb36bef 100644
--- a/parquet/src/arrow/arrow_reader/read_plan.rs
+++ b/parquet/src/arrow/arrow_reader/read_plan.rs
@@ -110,19 +110,22 @@ impl ReadPlanBuilder {
                     None => return RowSelectionStrategy::Selectors,
                 };
 
-                let trimmed = selection.clone().trim();
-                let selectors: Vec<RowSelector> = trimmed.into();
-                if selectors.is_empty() {
-                    return RowSelectionStrategy::Mask;
-                }
-
-                let total_rows: usize = selectors.iter().map(|s| 
s.row_count).sum();
-                let selector_count = selectors.len();
-                if selector_count == 0 {
+                // total_rows: total number of rows selected / skipped
+                // effective_count: number of non-empty selectors
+                let (total_rows, effective_count) =
+                    selection.iter().fold((0usize, 0usize), |(rows, count), s| 
{
+                        if s.row_count > 0 {
+                            (rows + s.row_count, count + 1)
+                        } else {
+                            (rows, count)
+                        }
+                    });
+
+                if effective_count == 0 {
                     return RowSelectionStrategy::Mask;
                 }
 
-                if total_rows < selector_count.saturating_mul(threshold) {
+                if total_rows < effective_count.saturating_mul(threshold) {
                     RowSelectionStrategy::Mask
                 } else {
                     RowSelectionStrategy::Selectors

Reply via email to