[GitHub] [iceberg] mikulskibartosz opened a new issue, #7457: PyIceberg doesn't support tables compacted with AWS Athena

via GitHub Fri, 28 Apr 2023 03:03:27 -0700


mikulskibartosz opened a new issue, #7457:
URL: https://github.com/apache/iceberg/issues/7457


   ### Apache Iceberg version
   
   1.1.0
   
   ### Query engine
   
   Athena
   
   ### Please describe the bug 🐞
   
   It's not possible to read an Iceberg table with PyIceberg if the data was 
written using PySpark and compacted with AWS Athena.
   
   ## Steps to reproduce
   
   1. Create an Iceberg table:
   
   ```sql
   CREATE TABLE IF NOT EXISTS table_name
           (columns ...)
           USING ICEBERG
           PARTITIONED BY (date)
   ```
   
   2. Write to the table using PySpark:
   
   ```python
   spark_df = self.spark_session.createDataFrame(df)
   spark_df.sort(date_column).writeTo(table_name).append()
   ```
   
   3. Read the table using PyIceberg:
   
   ```python
   catalog = load_glue("default", {})
   table = catalog.load_table('...')
   
   scan = table.scan(
       row_filter=EqualTo("date", date_as_string),
   )
   result = scan.to_arrow()
   ```
   
   The `result` variable contains correct data.
   
   4. Compact the table files using the OPTIMIZE instruction in AWS Athena. 
https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html
   
   ```sql
   OPTIMIZE table_name REWRITE DATA USING BIN_PACK WHERE date = 'date_as_string'
   ```
   
   5. Optionally, VACUUM the table. It doesn't matter and doesn't change the 
behavior in any way.
   
   6. Query the table using the same PyIceberg code as in step 3.
   
   7. `to_arrow` raises an exception: `ValueError: Iceberg schema is not 
embedded into the Parquet file, see 
https://github.com/apache/iceberg/issues/6505`
   
   8. The table can still be accessed correctly in AWS Athena.
   
   ## Expected behavior
   
   In step 7, the code should work correctly and return the same results as the 
code in step 3.
   
   ## Dependency versions 
   
   ### Writing data (step 2)
   
   * pyarrow: 11.0.0
   * pyspark: 3.3.1
   * iceberg-spark-runtime-3.3_2.12-1.1.0.jar
   
   ### Reading data (steps 3 and 7):
   
   ```python
   pyiceberg.__version__
   '0.3.0'
   
   pyarrow.__version__
   '10.0.1'
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] mikulskibartosz opened a new issue, #7457: PyIceberg doesn't support tables compacted with AWS Athena

Reply via email to