Re: [PR] Allow writing `pa.Table` that are either a subset of table schema or in arbitrary order, and support type promotion on write [iceberg-python]

via GitHub Tue, 16 Jul 2024 09:00:10 -0700


syun64 commented on code in PR #921:
URL: https://github.com/apache/iceberg-python/pull/921#discussion_r1679682973



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1896,16 +1906,6 @@ def data_file_statistics_from_parquet_metadata(
             set the mode for column metrics collection
         parquet_column_mapping (Dict[str, int]): The mapping of the parquet 
file name to the field ID
     """
-    if parquet_metadata.num_columns != len(stats_columns):
-        raise ValueError(
-            f"Number of columns in statistics configuration 
({len(stats_columns)}) is different from the number of columns in pyarrow table 
({parquet_metadata.num_columns})"
-        )
-
-    if parquet_metadata.num_columns != len(parquet_column_mapping):
-        raise ValueError(
-            f"Number of columns in column mapping 
({len(parquet_column_mapping)}) is different from the number of columns in 
pyarrow table ({parquet_metadata.num_columns})"
-        )
-

Review Comment:
   I've removed this check now that we have a comprehensive schema check in the 
write APIs.
   
   Removal of these checks is necessary in order to allow `add_files` to add 
tables with subset schema to the Iceberg Table.
   
   ```
   FAILED 
tests/integration/test_add_files.py::test_add_files_subset_of_schema[1] - 
ValueError: Number of columns in statistics configuration (4) is different from 
the number of columns in pyarrow table (3)
   ```
   
   I think we are using the field IDs to aggregate into the `stats_columns` 
this should be a safe change. We will leave the work of flagging column 
incompatibilities to the updated `_check_schema_compatible` function



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Allow writing `pa.Table` that are either a subset of table schema or in arbitrary order, and support type promotion on write [iceberg-python]

Reply via email to