ProjectYht opened a new issue, #23219:
URL: https://github.com/apache/datafusion/issues/23219
### Describe the bug
## Title
Panic in DataFusion 54.0.0 when ordering Parquet scan by computed projection
alias with statistics collection enabled
## Description
I hit a panic in DataFusion 54.0.0 when querying a Parquet dataset from S3
and ordering by a computed projection alias.
The query is valid SQL and runs successfully in DuckDB on the same Parquet
file. In DataFusion CLI, it panics during
statistics handling.
## Environment
- DataFusion version: 54.0.0
- Interface: DataFusion CLI and Rust API
- Source: Parquet files on S3
- Dataset layout: Hive-style path segment, e.g.
s3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/
The query does not reference partition_col.
The physical Parquet schema contains an id column.
## Minimal SQL
CREATE EXTERNAL TABLE profile
STORED AS PARQUET
LOCATION
's3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/';
SELECT
(((CAST(id AS BIGINT) % 1024) + 1024) % 1024) AS computed_bucket
FROM profile
ORDER BY computed_bucket, CAST(id AS BIGINT)
LIMIT 10;
## Observed panic
thread 'main' panicked at
.../datafusion-datasource-54.0.0/src/statistics.rs:100:48:
index out of bounds: the len is 0 but the index is 0
The relevant source location appears to be:
// datafusion-datasource-54.0.0/src/statistics.rs
if i < s.column_statistics.len() {
...
} else {
let partition_value = &pv[i - s.column_statistics.len()];
...
}
It looks like this path assumes that when a sort/statistics column index
exceeds column_statistics.len(), the column must be
a partition column and therefore indexes into partition_values. In this case
partition_values is empty, causing the panic.
## Additional observation
The physical id column does have Parquet min/max statistics. I checked with
DuckDB metadata:
SELECT
count(*) AS row_groups,
count(*) FILTER (WHERE stats_min IS NULL OR stats_max IS NULL) AS
missing_minmax,
count(*) FILTER (WHERE stats_min IS NOT NULL AND stats_max IS NOT NULL) AS
has_minmax
FROM
parquet_metadata('s3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/*.parquet')
WHERE path_in_schema = 'id';
Result:
row_groups = 75
missing_minmax = 0
has_minmax = 75
So this does not appear to be caused by missing min/max statistics for id.
## DuckDB comparison
The same query shape works in DuckDB:
SELECT
(((CAST(id AS BIGINT) % 1024) + 1024) % 1024) AS computed_bucket
FROM
's3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/<redacted-file>.parquet'
ORDER BY computed_bucket, CAST(id AS BIGINT)
LIMIT 10;
## Expected behavior
DataFusion should not panic.
If statistics cannot be derived for the computed expression computed_bucket,
I would expect DataFusion to fall back to
unknown/absent column statistics and continue executing the query.
### To Reproduce
_No response_
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]