HonahX commented on issue #584:
URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2049204249

   Sorry for being late here. Thanks everyone for the valuable discussion, PR 
and tests. I would like to add some insight on the read side. Regarding @Fokko 
's comment:
   
   > If we truly read by Field-IDs the names should be irrelevant, so we should 
probably update our mapping to ensure we correctly project by IDs.
   
   I think we do correctly project by IDs. The real problem is the way that we 
sanitize the column names.
   In https://github.com/apache/iceberg-python/pull/83, we add sanitize [the 
file_schema](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L982C9-L982C126)
 in _task_to_table with the assumption that the column name  of the parquet 
file follows the Avro Naming spec. However, I think the "sanitization" should 
be more general here: it should just ensure that the final 
`file_project_schema` contains the same column names of the parquet file's 
schema. 
   
   The names in `file_schema` are different from the actual column names in the 
parquet file because we first try to load the file schema from the json string 
stored in the parquet file metadata: 
[link](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L982C9-L982C126)
   ```python
   file_schema = (
               Schema.model_validate_json(schema_raw) if schema_raw is not None 
else pyarrow_to_schema(physical_schema, name_mapping)
           )
   ```
   Parquet files written by iceberg java contain this metadata json string. The 
json string represents the iceberg table schema at the time of writting the 
file. Therefore, it contains un-sanitized column names. 
   
   Since we always need to run a visitor to sanitize/ensure column names match, 
how about we just get the `file_schema` directly from the pyarrow physical 
schema?
   ```python
   file_schema = pyarrow_to_schema(physical_schema, name_mapping)
   ```
   This way, we can ensure that the column names match, and thus do not need to 
sanitize the column names later.
   
   I have verified that changing to this can fix both the sanitization issue in 
#83 and the issue here. Given that we want to align the writing behavior with 
the java implementation, we should also proceed #590.
   
   Appreciate any thoughts and suggestions on this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to