mbutrovich commented on PR #22026: URL: https://github.com/apache/datafusion/pull/22026#issuecomment-4390069611
> Thanks @mbutrovich I'll do a coder review later today/tomorrow , but another high level question is on the schema column name collision between user columns and virtual columns? how it can affect already existing user queries? Current DataFusion behavior would delegate that to figure out in #20135. Using partition columns as an example: DataFusion does not validate partition-vs-file-column collisions in the core code path. ListingTable::try_new (catalog-listing/src/table.rs:216) just pushes partition fields onto the file schema via SchemaBuilder::push, which arrow documents as "without checking for collision” (arrow-schema/src/schema.rs:48), and finish doesn't dedupe. Same in our TableSchema::build_table_schema. The only place collision is handled is the SQL-layer ListingTableFactory (listing_table_factory.rs:133), which for `CREATE EXTERNAL TABLE ... PARTITIONED BY (c1)` explicitly excludes c1 from the file schema before constructing options. It solves the problem by construction for the SQL entry point but doesn't guard the data structure itself. A caller constructing ListingTable/TableSchema programmatically with colliding names gets silent duplicates. So existing convention is: upstream avoids bad names, core structs don't validate. That said, I am not opposed to enforcing the contract if we want to make it explicit now that `with_virtual_columns` validates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
