Re: [PR] PyArrow: Don't enforce the schema [iceberg-python]

via GitHub Wed, 10 Jul 2024 06:28:08 -0700


syun64 commented on code in PR #902:
URL: https://github.com/apache/iceberg-python/pull/902#discussion_r1672267800



##########
pyiceberg/table/__init__.py:
##########
@@ -1884,8 +1884,9 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader:
 
         from pyiceberg.io.pyarrow import project_batches, schema_to_pyarrow
 
+        target_schema = schema_to_pyarrow(self.projection())

Review Comment:
   My knowledge on Parquet data to Arrow buffer conversion is less versed, so 
please do check me if I am not making much sense 🙂 
   
   But are we actually casting the types on read?
   
   We make a decision on whether we are choosing to read with large or small 
types when instantiating the [fragment 
scanner](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1052C20-L1052C54),
 which loads the parquet data into the Arrow buffers. The `schema_to_pyarrow()` 
calls to `pa.Table` or `pa.RecordBatchReader` or in `to_requested_schema` 
following that all represent the Table schema in the consistent (large or 
small) format which shouldn't result in any additional casting and reassigning 
of buffers.
   
   I think the only time we are casting the types is on write, where we may 
want to downcast it for forward compatibility. It looks like we have to choose 
a schema to use on write anyways, because using a schema for the ParquetWriter 
that isn't consistent with the schema within the dataframe results in an 
[exception](https://github.com/apache/iceberg-python/pull/902/files#r1669524329).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] PyArrow: Don't enforce the schema [iceberg-python]

Reply via email to