Fokko commented on code in PR #1743: URL: https://github.com/apache/iceberg-python/pull/1743#discussion_r1979125483
########## pyiceberg/io/pyarrow.py: ########## @@ -2475,18 +2484,25 @@ def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_ f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids" ) schema = table_metadata.schema() - _check_pyarrow_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema()) + if check_schema: + _check_pyarrow_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema()) statistics = data_file_statistics_from_parquet_metadata( parquet_metadata=parquet_metadata, stats_columns=compute_statistics_plan(schema, table_metadata.properties), parquet_column_mapping=parquet_path_to_id_mapping(schema), + check_schema=check_schema, ) + if partition_deductor is None: + partition = statistics.partition(table_metadata.spec(), table_metadata.schema()) + else: + partition = partition_deductor(file_path) Review Comment: While you can add keys to the `Record`, it is looked up by position, based on the Schema that belongs to it (in this case, the one of the active PartitionSpec. ########## pyiceberg/io/pyarrow.py: ########## @@ -2475,18 +2484,25 @@ def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_ f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids" ) schema = table_metadata.schema() - _check_pyarrow_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema()) + if check_schema: + _check_pyarrow_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema()) Review Comment: At Iceberg, we're pretty concerned at making sure that everything is compatible at write time. Instead, we could also change the `_check_pyarrow_schema_compatible` to allow for additional columns in the Parquet column. It is okay to skip `optional` columns but not `required` ones. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org