Kuinox opened a new issue, #48254:
URL: https://github.com/apache/arrow/issues/48254
### Describe the bug, including details regarding any error messages,
version, and platform.
### Summary
UUID extension types are preserved in tables but dropped by
`pyarrow.parquet.read_schema`, creating an asymmetry between the table’s schema
and the schema read from Parquet metadata.
### Steps to Reproduce
```python
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
import tempfile
data = [
b'\xe4`\xf9p\x83QGN\xac\x7f\xa4g>\x4b\xa8\xcb',
b'\x1et\x14\x95\xee\xd5C\xea\x9b\xd7s\xdc\x91BK\xaf',
None,
]
table = pa.table([pa.array(data, type=pa.uuid())], names=["ext"])
print("table schema type:", table.schema.field("ext").type) #
extension<arrow.uuid>
path = Path(tempfile.gettempdir()) / "uuid_ext_test.parquet"
pq.write_table(table, path, store_schema=False)
print("read_schema type:", pq.read_schema(path).field("ext").type)
print("read_table schema type:",
pq.read_table(path).schema.field("ext").type)
### Expected Behavior
read_schema(path) should yield the same type as the table schema (and
read_table), i.e., extension<arrow.uuid>.
### Actual Behavior
read_schema(path) returns fixed_size_binary[16], while the original
table.schema and read_table(path).schema both report extension<arrow.uuid>, so
metadata-based schema inspection drops the extension type.
### Notes
- Observed with the current pyarrow wheel (22.0.0) and current main
sources.
- ParquetFile(...).schema_arrow preserves the extension type; read_schema
does not
### Component(s)
Python, Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]