alamb commented on PR #16659:
URL: https://github.com/apache/datafusion/pull/16659#issuecomment-3028490668
This is a somewhat subtle issue so I will try and summarize:
In DataFusion 47 and earlier,
1. Calling `DataFrame::register_parquet` collected statistics for the table
at create time (slower to create table, potentially faster quer)
2. Calling `CREATE EXTERNAL TABLE` did not collect statistics (faster to
create table, but potentially slower query)
There are more details about this on the ticket from @davisp here:
- https://github.com/apache/datafusion/issues/15908
In DataFusion 48.0.0:
1. https://github.com/apache/datafusion/pull/16080 made
`DataFrame::register_parquet` and `CREATE EXTERNAL TABLE` DID NOT collect
statistics.
However this means that users who were relying on statistics, such as
@AdamGS , saw queries get slower (see
https://github.com/apache/datafusion/issues/16444)
Thus this PR proposes changing `DataFusion 48.0.1` so
1. Both `DataFrame::register_parquet` and `CREATE EXTERNAL TABLE` **WILL**
collect statistics.
Note that this is consistent with the behavior on the latest `main` (what
will be released in DataFusion 49.0.0):
- https://github.com/apache/datafusion/issues/16158
Since we have already made this change, the thinking is that by changing
48.0.1 we'll avoid people full migrating to the "no default statistics"
behavior only to have to change back again in 49
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]