syun64 commented on code in PR #902: URL: https://github.com/apache/iceberg-python/pull/902#discussion_r1672267800
########## pyiceberg/table/__init__.py: ########## @@ -1884,8 +1884,9 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader: from pyiceberg.io.pyarrow import project_batches, schema_to_pyarrow + target_schema = schema_to_pyarrow(self.projection()) Review Comment: My knowledge on Parquet data to Arrow buffer conversion is less versed, so please do check me if I am not making much sense 🙂 But are we actually casting the types on read? We make a decision on whether we are choosing to read with large or small types when instantiating the [fragment scanner](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1052C20-L1052C54), which loads the parquet data into the Arrow buffers. The `schema_to_pyarrow()` calls to `pa.Table` or `pa.RecordBatchReader` or in `to_requested_schema` following that all represent the Table schema in the consistent (large or small) format which shouldn't result in any additional casting and reassigning of buffers. I think the only time we are casting the types is on write, where we may want to downcast it for forward compatibility. It looks like we have to choose a schema to use on write anyways, because using a schema for the ParquetWriter that isn't consistent with the schema within the dataframe results in an [exception](https://github.com/apache/iceberg-python/pull/902/files#r1669524329). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org