HonahX commented on issue #584: URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2049204249
Sorry for being late here. Thanks everyone for the valuable discussion, PR and tests. I would like to add some insight on the read side. Regarding @Fokko 's comment: > If we truly read by Field-IDs the names should be irrelevant, so we should probably update our mapping to ensure we correctly project by IDs. I think we do correctly project by IDs. The real problem is the way that we sanitize the column names. In https://github.com/apache/iceberg-python/pull/83, we add sanitize [the file_schema](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L982C9-L982C126) in _task_to_table with the assumption that the column name of the parquet file follows the Avro Naming spec. However, I think the "sanitization" should be more general here: it should just ensure that the final `file_project_schema` contains the same column names of the parquet file's schema. The names in `file_schema` are different from the actual column names in the parquet file because we first try to load the file schema from the json string stored in the parquet file metadata: [link](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L982C9-L982C126) ```python file_schema = ( Schema.model_validate_json(schema_raw) if schema_raw is not None else pyarrow_to_schema(physical_schema, name_mapping) ) ``` Parquet files written by iceberg java contain this metadata json string. The json string represents the iceberg table schema at the time of writting the file. Therefore, it contains un-sanitized column names. Since we always need to run a visitor to sanitize/ensure column names match, how about we just get the `file_schema` directly from the pyarrow physical schema? ```python file_schema = pyarrow_to_schema(physical_schema, name_mapping) ``` This way, we can ensure that the column names match, and thus do not need to sanitize the column names later. I have verified that changing to this can fix both the sanitization issue in #83 and the issue here. Given that we want to align the writing behavior with the java implementation, we should also proceed #590. Appreciate any thoughts and suggestions on this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org