jcklie-amazon opened a new issue, #47201:
URL: https://github.com/apache/arrow/issues/47201
### Describe the bug, including details regarding any error messages,
version, and platform.
I have a use case where I need to process files that have been originally
written by PySpark with pyarrow and write them back so that they can again be
read by PySpark. During the processing, I need to extend the schema.
When extending the schema and writing it with `flavour=pyspark` , then
reading with PySpark does not show the added columns. Using other Parquet tools
like pandas, polars or random online viewers, the data is there.
My guess for what is happening after some debugging is that PySpark uses not
the parquet schema itself if there is metadata called
`org.apache.spark.sql.parquet.row.metadata` that pyarrow does not update. When
I call `replace_schema_metadata()` before writing the table in PyArrow, then
PySpark can see the new column. I would expect writing with pyarrow using the
pyspark flavor would handle this.
Code to reproduce:
`pip install pyarrow pyspark`
```python
from pyspark.sql import SparkSession
import pyarrow as pa
import pyarrow.parquet as pq
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
(1, "foo"),
(2, "bar"),
],
["id", "label"] # add your column names here
)
df.write.mode("overwrite").parquet("table.parquet")
table = pq.read_table("table.parquet")
result = []
for batch in table.to_batches():
d = batch.to_pydict()
d["value"] = d["label"] * 2
schema = table.schema.append(pa.field("value", pa.string()))
enriched_table = table.from_pylist(result, schema=schema)
pq.write_table(enriched_table, 'enriched_table.parquet', flavor="spark")
print("# Pyarrow reading enriched")
print()
enriched_table2 = pq.read_table('enriched_table.parquet')
print(enriched_table2.schema)
print(enriched_table2.schema.metadata)
print("# Pyspark reading enriched")
print()
df = spark.read.parquet('enriched_table.parquet')
df.printSchema()
print(df.select("value"))
```
Output:
```
# Pyarrow reading enriched
id: int64
label: string
value: string
-- schema metadata --
org.apache.spark.version: '3.5.5'
org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' +
120
{b'org.apache.spark.version': b'3.5.5',
b'org.apache.spark.sql.parquet.row.metadata':
b'{"type":"struct","fields":[{"name":"id","type":"long","nullable":true,"metadata":{}},{"name":"label","type":"string","nullable":true,"metadata":{}}]}'}
# Pyspark reading enriched
root
|-- id: long (nullable = true)
|-- label: string (nullable = true)
Traceback (most recent call last):
pyspark.errors.exceptions.captured.AnalysisException:
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name
`value` cannot be resolved. Did you mean one of the following? [`id`, `label`].;
'Project ['value]
+- Relation [id#6L,label#7] parquet
Process finished with exit code 1
```
This use case might look a bit convoluted, but in the whole context of the
app, it makes sense.
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]