kosiew commented on code in PR #22360:
URL: https://github.com/apache/datafusion/pull/22360#discussion_r3303215096
##########
datafusion/datasource/src/mod.rs:
##########
@@ -163,12 +164,15 @@ pub struct PartitionedFile {
/// The estimated size of the parquet metadata, in bytes
pub metadata_size_hint: Option<usize>,
pub table_reference: Option<TableReference>,
+ /// A user-provided arrow schema for the file.
+ pub arrow_schema: Option<SchemaRef>,
Review Comment:
I think this needs a follow-up before merge. `PartitionedFile::arrow_schema`
introduces a new user-provided scan contract, but physical plan proto
serialization currently appears to drop it.
`datafusion/proto/src/physical_plan/to_proto.rs` builds
`protobuf::PartitionedFile` without this field, and
`datafusion/proto/proto/datafusion.proto` does not seem to have a schema field
for it.
As a result, a Parquet scan that is serialized and deserialized would lose
the explicit schema and fall back to parsing `ARROW:schema`, so the main
guarantee from this change would not hold end to end.
Could you please add this field to the proto model and conversions, plus a
roundtrip test showing that `PartitionedFile::with_arrow_schema(...)` survives
physical plan or `PartitionedFile` proto serialization?
##########
datafusion/datasource/src/mod.rs:
##########
@@ -163,12 +164,15 @@ pub struct PartitionedFile {
/// The estimated size of the parquet metadata, in bytes
pub metadata_size_hint: Option<usize>,
pub table_reference: Option<TableReference>,
+ /// A user-provided arrow schema for the file.
Review Comment:
Small doc suggestion: it would be helpful to make the public contract a bit
more precise here. My read is that this is the physical Arrow file schema used
by the Parquet opener, it should describe file columns rather than partition
columns, and it is currently ignored by non-Parquet sources.
Calling that out explicitly should help avoid users passing a table schema
that includes partitions, or expecting CSV and JSON readers to honor this field.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]