mbutrovich commented on PR #22026: URL: https://github.com/apache/datafusion/pull/22026#issuecomment-4389588406
> I hate to bring this up but do you think think this approach is compatible with all of the proposed UX versions of this? I don't know that the UDF version is blessed necessarily, I want to make sure we don't back ourselves into a corner. I think the answer is yes but just want to confirm. Nah I think it's a good thing to sort out now from an API stability standpoint. It's a necessary discussion even if we're trying to carve out a small portion of the work to be a more bite-size PR now. I think the right lens is API stability: what does this PR add to the public surface that we'd have to live with (or break) if a later UX decision wants something different? The answer is a pretty small surface: * `TableSchema::with_virtual_columns(Vec<FieldRef>)` and `virtual_columns()` getter * `TableSchema::schema_without_virtual_columns()` * `ParquetVirtualColumn` enum in `datasource-parquet` The shape that's actually load-bearing across all of these is one sentence: **a virtual column is a leaf `FieldRef` carrying an Arrow extension type, held on `TableSchema` in its own bucket**. Everything else in the PR is either internal to the opener (schema augmentation during rewrite, projection-mask stripping, the pushdown defensive check) or a convention we can change without breaking callers (the `[file, partition, virtual]` ordering in `table_schema()`, which is just a projection away from any other layout). Mapping that against the UX directions in #20135: * UDF rewrite (`input_file_name()` shape, now applied to a reader-produced value): planner rewrites the UDF to a `Column` ref, TableProvider puts a `RowNumber`-tagged field in `virtual_columns`. Same seam as #20071. * `_metadata` struct column (Spark / Databricks): TableProvider exposes the struct at the SQL surface, projection pushdown reduces `_metadata.row_index` to a flat leaf, opener produces the leaf, a projection above the scan rebuilds the struct. * Relation-scoped hidden columns (`a.row_number`): TableProvider tags a hidden column with `RowNumber` and adds it to `virtual_columns` when projected. Hiding policy sits above. * @jkylling's extension-type tagging (`Hidden` + `RowNumber`): most direct fit, extension types are the mechanism already. All four land on the same seam. None of them would force us to break the `TableSchema` methods or the enum. The one case where we'd feel locked in is if a later design wants to model a "virtual column" that isn't backed by an arrow-rs extension type (a purely DataFusion-synthesized value, say). That would want a different seam. I'd argue that's a different concept anyway and shouldn't share the `virtual_columns` bucket, but worth noting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
