Re: [PR] Allow writing dataframes that are either a subset of table schema or in arbitrary order [iceberg-python]

via GitHub Fri, 12 Jul 2024 11:32:08 -0700


kevinjqliu commented on code in PR #829:
URL: https://github.com/apache/iceberg-python/pull/829#discussion_r1676323868



##########
pyiceberg/table/__init__.py:
##########
@@ -484,10 +484,6 @@ def append(self, df: pa.Table, snapshot_properties: 
Dict[str, str] = EMPTY_DICT)
         _check_schema_compatible(
             self._table.schema(), other_schema=df.schema, 
downcast_ns_timestamp_to_us=downcast_ns_timestamp_to_us
         )
-        # cast if the two schemas are compatible but not equal

Review Comment:
   @syun64 I want to get your take on this part. Due to the timestamp change, 
do you know if the `df` need to be casted? 
   There are a couple of different parts involved in the write path. In 
particular, we need to look at the table schema, the df schema, and the df 
itself. As well as dealing with bin-packing and other transformations. 
   
   



##########
pyiceberg/table/__init__.py:
##########
@@ -484,10 +484,6 @@ def append(self, df: pa.Table, snapshot_properties: 
Dict[str, str] = EMPTY_DICT)
         _check_schema_compatible(
             self._table.schema(), other_schema=df.schema, 
downcast_ns_timestamp_to_us=downcast_ns_timestamp_to_us
         )
-        # cast if the two schemas are compatible but not equal

Review Comment:
   Happy to extract this convo into an issue, to also continue the convo from 
https://github.com/apache/iceberg-python/pull/786#discussion_r1646417180



##########
pyiceberg/io/pyarrow.py:
##########
@@ -2053,7 +2055,10 @@ def _check_schema_compatible(table_schema: Schema, 
other_schema: pa.Schema, down
             f"PyArrow table contains more columns: {', 
'.join(sorted(additional_names))}. Update the schema first (hint, use 
union_by_name)."
         ) from e
 
-    if table_schema.as_struct() != task_schema.as_struct():
+    fields_missing_from_table = {field for field in other_schema.fields if 
field not in table_schema.fields}

Review Comment:
   this doesn't work for nested structs, need a better solution



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Allow writing dataframes that are either a subset of table schema or in arbitrary order [iceberg-python]

Reply via email to