[I] Panic in DataFusion 54.0.0 when ordering Parquet scan by computed projection alias [datafusion]

via GitHub Sat, 27 Jun 2026 05:56:51 -0700


ProjectYht opened a new issue, #23219:
URL: https://github.com/apache/datafusion/issues/23219


   ### Describe the bug
   
   ## Title
   
   Panic in DataFusion 54.0.0 when ordering Parquet scan by computed projection 
alias with statistics collection enabled
   
   ## Description
   
   I hit a panic in DataFusion 54.0.0 when querying a Parquet dataset from S3 
and ordering by a computed projection alias.
   
   The query is valid SQL and runs successfully in DuckDB on the same Parquet 
file. In DataFusion CLI, it panics during
   statistics handling.
   
   ## Environment
   
   - DataFusion version: 54.0.0
   - Interface: DataFusion CLI and Rust API
   - Source: Parquet files on S3
   - Dataset layout: Hive-style path segment, e.g.
   
   s3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/
   
   The query does not reference partition_col.
   
   The physical Parquet schema contains an id column.
   
   ## Minimal SQL
   
   CREATE EXTERNAL TABLE profile
   STORED AS PARQUET
   LOCATION 
's3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/';
   
   SELECT
     (((CAST(id AS BIGINT) % 1024) + 1024) % 1024) AS computed_bucket
   FROM profile
   ORDER BY computed_bucket, CAST(id AS BIGINT)
   LIMIT 10;
   
   ## Observed panic
   
   thread 'main' panicked at 
.../datafusion-datasource-54.0.0/src/statistics.rs:100:48:
   index out of bounds: the len is 0 but the index is 0
   
   The relevant source location appears to be:
   
   // datafusion-datasource-54.0.0/src/statistics.rs
   if i < s.column_statistics.len() {
       ...
   } else {
       let partition_value = &pv[i - s.column_statistics.len()];
       ...
   }
   
   It looks like this path assumes that when a sort/statistics column index 
exceeds column_statistics.len(), the column must be
   a partition column and therefore indexes into partition_values. In this case 
partition_values is empty, causing the panic.
   
   ## Additional observation
   
   The physical id column does have Parquet min/max statistics. I checked with 
DuckDB metadata:
   
   SELECT
     count(*) AS row_groups,
     count(*) FILTER (WHERE stats_min IS NULL OR stats_max IS NULL) AS 
missing_minmax,
     count(*) FILTER (WHERE stats_min IS NOT NULL AND stats_max IS NOT NULL) AS 
has_minmax
   FROM 
parquet_metadata('s3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/*.parquet')
   WHERE path_in_schema = 'id';
   
   Result:
   
   row_groups = 75
   missing_minmax = 0
   has_minmax = 75
   
   So this does not appear to be caused by missing min/max statistics for id.
   
   ## DuckDB comparison
   
   The same query shape works in DuckDB:
   
   SELECT
     (((CAST(id AS BIGINT) % 1024) + 1024) % 1024) AS computed_bucket
   FROM 
's3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/<redacted-file>.parquet'
   ORDER BY computed_bucket, CAST(id AS BIGINT)
   LIMIT 10;
   
   ## Expected behavior
   
   DataFusion should not panic.
   
   If statistics cannot be derived for the computed expression computed_bucket, 
I would expect DataFusion to fall back to
   unknown/absent column statistics and continue executing the query.
   
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Panic in DataFusion 54.0.0 when ordering Parquet scan by computed projection alias [datafusion]

Reply via email to