Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]

via GitHub Sun, 07 Apr 2024 11:45:49 -0700


kevinjqliu commented on issue #584:
URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2041559077

> Further research shows that when I use
[daft](https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/iceberg.html#reading-a-table)
that I'm able to read and use the to_arrow() functionality just fine. This is
interesting especially because daft utilizes pyiceberg.

The column name transformation behavior is part of the Java Iceberg spec
when reading/writing parquet files. Specifically, the transformed schema is
pushed down to parquet reader/writer.
I suspect this is happening since the Java parquet implementation supports
both Avro and parquet schema (See [parquet
cli](https://github.com/apache/parquet-mr/blob/db4183109d5b734ec5930d870cdae161e408ddba/parquet-cli/src/main/java/org/apache/parquet/cli/commands/SchemaCommand.java#L106-L111)).
So to be compatible with both parquet and Avro schemas, this column name
transformation behavior is used.

From what I've seen, libraries in other languages do not do this. This means
these libraries can read/write parquet files having special characters in their
column names.

Daft uses the Rust Arrow library which can read parquet files with special
characters in their column names.
Similarly, pyarrow can read it as well.

I checked major parquet libraries in Python, Rust, Golang and they can all
support reading special characters in parquet column names.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]

Reply via email to