syun64 commented on code in PR #921: URL: https://github.com/apache/iceberg-python/pull/921#discussion_r1679682973
########## pyiceberg/io/pyarrow.py: ########## @@ -1896,16 +1906,6 @@ def data_file_statistics_from_parquet_metadata( set the mode for column metrics collection parquet_column_mapping (Dict[str, int]): The mapping of the parquet file name to the field ID """ - if parquet_metadata.num_columns != len(stats_columns): - raise ValueError( - f"Number of columns in statistics configuration ({len(stats_columns)}) is different from the number of columns in pyarrow table ({parquet_metadata.num_columns})" - ) - - if parquet_metadata.num_columns != len(parquet_column_mapping): - raise ValueError( - f"Number of columns in column mapping ({len(parquet_column_mapping)}) is different from the number of columns in pyarrow table ({parquet_metadata.num_columns})" - ) - Review Comment: I've removed this check now that we have a comprehensive schema check in the write APIs. Removal of these checks is necessary in order to allow `add_files` to add tables with subset schema to the Iceberg Table. ``` FAILED tests/integration/test_add_files.py::test_add_files_subset_of_schema[1] - ValueError: Number of columns in statistics configuration (4) is different from the number of columns in pyarrow table (3) ``` I think we are using the field IDs to aggregate into the `stats_columns` this should be a safe change. We will leave the work of flagging column incompatibilities to the updated `_check_schema_compatible` function -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org