Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]

via GitHub Thu, 11 Apr 2024 01:37:51 -0700


HonahX commented on issue #584:
URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2049204249

Sorry for being late here. Thanks everyone for the valuable discussion, PR
and tests. I would like to add some insight on the read side. Regarding @Fokko
's comment:

> If we truly read by Field-IDs the names should be irrelevant, so we should
probably update our mapping to ensure we correctly project by IDs.

I think we do correctly project by IDs. The real problem is the way that we
sanitize the column names.
In https://github.com/apache/iceberg-python/pull/83, we add sanitize [the
file_schema](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L982C9-L982C126)
in _task_to_table with the assumption that the column name of the parquet
file follows the Avro Naming spec. However, I think the "sanitization" should
be more general here: it should just ensure that the final
`file_project_schema` contains the same column names of the parquet file's
schema.

The names in `file_schema` are different from the actual column names in the
parquet file because we first try to load the file schema from the json string
stored in the parquet file metadata:
[link](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L982C9-L982C126)
```python
file_schema = (
Schema.model_validate_json(schema_raw) if schema_raw is not None
else pyarrow_to_schema(physical_schema, name_mapping)
)
```
Parquet files written by iceberg java contain this metadata json string. The
json string represents the iceberg table schema at the time of writting the
file. Therefore, it contains un-sanitized column names.

Since we always need to run a visitor to sanitize/ensure column names match,
how about we just get the `file_schema` directly from the pyarrow physical
schema?
```python
file_schema = pyarrow_to_schema(physical_schema, name_mapping)
```
This way, we can ensure that the column names match, and thus do not need to
sanitize the column names later.

I have verified that changing to this can fix both the sanitization issue in
#83 and the issue here. Given that we want to align the writing behavior with
the java implementation, we should also proceed #590.

Appreciate any thoughts and suggestions on this!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]

Reply via email to