adriangb opened a new pull request, #21965: URL: https://github.com/apache/datafusion/pull/21965
## Which issue does this PR close? - Closes #. ## Rationale for this change When you point `CREATE EXTERNAL TABLE` at an empty directory (or one that does not exist yet) without specifying an explicit column list, DataFusion silently creates a table with **0 columns**. Any query against that table then fails with a confusing "column not found" / "no such column" error that gives no hint that the underlying issue is actually that schema inference had nothing to look at. This is the same root cause as the discussion on https://github.com/apache/datafusion/pull/21806#issuecomment-4355371528 — that thread covered it from the angle of benchmark runners hitting it, but the confusion is not specific to benchmarks. Failing at `CREATE EXTERNAL TABLE` time with a clear, actionable message seemed like the right fix overall. ## What changes are included in this PR? `ListingOptions::infer_schema` now returns a `Plan` error when the location yields no files (after the existing 0-byte filter), telling the user to either add data files or declare an explicit schema: ``` Error during planning: No files found at file:///tmp/empty_dir/. Cannot infer schema from an empty location; either add data files or declare an explicit schema for the table. ``` Pre-declaring an empty table with an explicit schema (e.g. `CREATE EXTERNAL TABLE t(x int) STORED AS PARQUET LOCATION '...'` for later `INSERT`) still works — the inference path is only triggered when no schema is provided. ## Are these changes tested? Yes. New cases in `datafusion/sqllogictest/test_files/ddl.slt` cover: - Parquet, CSV, and JSON over an empty location without an explicit schema → all return the new `Plan` error. - An empty location with an explicit schema → still works and queries cleanly. - Schema inference still succeeds once files exist at the location, so the new check does not regress the happy path. ## Are there any user-facing changes? Yes — `CREATE EXTERNAL TABLE ... LOCATION '<empty-dir>'` without an explicit schema now errors at planning time instead of creating a 0-column table. Anyone relying on the previous behavior must add an explicit schema declaration. The error message tells them how. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
