qzyu999 opened a new issue, #50132:
URL: https://github.com/apache/arrow/issues/50132
### Describe the enhancement requested
### **Description:**
The C++ schema mapping between Parquet's `VARIANT` logical type and Arrow's
`VariantExtensionType` (`arrow.parquet.variant`) is established by GH-46104.
With the C++ Variant encoder (GH-45947) and decoder (GH-45946) implementations,
the underlying C++ Parquet reader/writer will be able to process actual data
payloads.
This issue tracks Python-level Parquet integration and testing to ensure
Python users can read and write Parquet files containing Variant columns
seamlessly.
#### **Proposed Changes:**
1. **Parquet Read/Write Pipeline Validation**:
* Ensure that `pyarrow.parquet.write_table` correctly serializes
`VariantExtensionType` columns into Parquet files with the `VARIANT` logical
type annotation.
* Ensure that `pyarrow.parquet.read_table` correctly deserializes
Parquet `VARIANT` columns back into PyArrow `VariantArray` columns (rather than
falling back to the raw binary-pair storage struct or throwing an unsupported
type exception).
2. **Metadata and Schema Inspection**:
* Verify that `pyarrow.parquet.read_schema` and `ParquetFile.schema`
correctly report the column type as the `VariantExtensionType` extension type.
3. **Integration Testing**:
* Add end-to-end tests in
`python/pyarrow/tests/parquet/test_data_types.py` (or a dedicated integration
test suite):
* Construct a `pyarrow.Table` containing a `VariantArray` (e.g.,
from nested dictionaries/lists).
* Write it to a file using `pyarrow.parquet.write_table`.
* Read the file back using `pyarrow.parquet.read_table` and assert
that the types and values are identical to the original table.
* Test reading a reference Parquet file containing Variant data
written by a different implementation (e.g., Go or Spark) to verify
cross-language compatibility.
#### **Dependencies:**
This issue is blocked by:
* **GH-50131**: [Python] Bindings for Variant canonical extension type
(Exposing the Python `VariantType`, `VariantArray`, and `VariantScalar` classes)
* **GH-45946** (PR #50121): [C++][Parquet] Variant decoding
* **GH-45947** (PR #50122): [C++][Parquet] Variant encoding
### Component(s)
Python, Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]