kosiew commented on code in PR #22360:
URL: https://github.com/apache/datafusion/pull/22360#discussion_r3303215096


##########
datafusion/datasource/src/mod.rs:
##########
@@ -163,12 +164,15 @@ pub struct PartitionedFile {
     /// The estimated size of the parquet metadata, in bytes
     pub metadata_size_hint: Option<usize>,
     pub table_reference: Option<TableReference>,
+    /// A user-provided arrow schema for the file.
+    pub arrow_schema: Option<SchemaRef>,

Review Comment:
   I think this needs a follow-up before merge. `PartitionedFile::arrow_schema` 
introduces a new user-provided scan contract, but physical plan proto 
serialization currently appears to drop it. 
`datafusion/proto/src/physical_plan/to_proto.rs` builds 
`protobuf::PartitionedFile` without this field, and 
`datafusion/proto/proto/datafusion.proto` does not seem to have a schema field 
for it.
   
   As a result, a Parquet scan that is serialized and deserialized would lose 
the explicit schema and fall back to parsing `ARROW:schema`, so the main 
guarantee from this change would not hold end to end.
   
   Could you please add this field to the proto model and conversions, plus a 
roundtrip test showing that `PartitionedFile::with_arrow_schema(...)` survives 
physical plan or `PartitionedFile` proto serialization?



##########
datafusion/datasource/src/mod.rs:
##########
@@ -163,12 +164,15 @@ pub struct PartitionedFile {
     /// The estimated size of the parquet metadata, in bytes
     pub metadata_size_hint: Option<usize>,
     pub table_reference: Option<TableReference>,
+    /// A user-provided arrow schema for the file.

Review Comment:
   Small doc suggestion: it would be helpful to make the public contract a bit 
more precise here. My read is that this is the physical Arrow file schema used 
by the Parquet opener, it should describe file columns rather than partition 
columns, and it is currently ignored by non-Parquet sources.
   
   Calling that out explicitly should help avoid users passing a table schema 
that includes partitions, or expecting CSV and JSON readers to honor this field.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to