[I] [Python][Parquet] Extending the schema and writing it back does not update PySpark metadata [arrow]

via GitHub Sat, 26 Jul 2025 01:09:40 -0700


jcklie-amazon opened a new issue, #47201:
URL: https://github.com/apache/arrow/issues/47201


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I have a use case where I need to process files that have been originally 
written by PySpark with pyarrow and write them back so that they can again be 
read by PySpark. During the processing, I need to extend the schema. 
   
   When extending the schema and writing it with `flavour=pyspark` , then 
reading with PySpark does not show the added columns. Using other Parquet tools 
like pandas, polars or random online viewers, the data is there. 
   
   My guess for what is happening after some debugging is that PySpark uses not 
the parquet schema itself if there is metadata called 
`org.apache.spark.sql.parquet.row.metadata` that pyarrow does not update. When 
I call `replace_schema_metadata()` before writing the table in PyArrow, then 
PySpark can see the new column. I would expect writing with pyarrow using the 
pyspark flavor would handle this.
   
   Code to reproduce:
   
   `pip install pyarrow pyspark`
   
   ```python
   from pyspark.sql import SparkSession
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   spark = SparkSession.builder.getOrCreate()
   df = spark.createDataFrame(
       [
           (1, "foo"),
           (2, "bar"),
       ],
       ["id", "label"]  # add your column names here
   )
   df.write.mode("overwrite").parquet("table.parquet")
   table = pq.read_table("table.parquet")
   
   result = []
   for batch in table.to_batches():
       d = batch.to_pydict()
       d["value"] = d["label"] * 2
   
   schema = table.schema.append(pa.field("value", pa.string()))
   enriched_table = table.from_pylist(result, schema=schema)
   pq.write_table(enriched_table, 'enriched_table.parquet',  flavor="spark")
   
   print("# Pyarrow reading enriched")
   print()
   enriched_table2 = pq.read_table('enriched_table.parquet')
   print(enriched_table2.schema)
   print(enriched_table2.schema.metadata)
   
   print("# Pyspark reading enriched")
   print()
   df = spark.read.parquet('enriched_table.parquet')
   df.printSchema()
   print(df.select("value"))
   ```
   
   Output:
   
   ```
   # Pyarrow reading enriched
   
   id: int64
   label: string
   value: string
   -- schema metadata --
   org.apache.spark.version: '3.5.5'
   org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 
120
   {b'org.apache.spark.version': b'3.5.5', 
b'org.apache.spark.sql.parquet.row.metadata': 
b'{"type":"struct","fields":[{"name":"id","type":"long","nullable":true,"metadata":{}},{"name":"label","type":"string","nullable":true,"metadata":{}}]}'}
   # Pyspark reading enriched
   
   root
    |-- id: long (nullable = true)
    |-- label: string (nullable = true)
   
   Traceback (most recent call last):
   
   pyspark.errors.exceptions.captured.AnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`id`, `label`].;
   'Project ['value]
   +- Relation [id#6L,label#7] parquet
   
   Process finished with exit code 1
   ```
   
   This use case might look a bit convoluted, but in the whole context of the 
app, it makes sense. 
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Python][Parquet] Extending the schema and writing it back does not update PySpark metadata [arrow]

Reply via email to