pedorro opened a new issue, #11650:
URL: https://github.com/apache/iceberg/issues/11650

   ### Apache Iceberg version
   
   1.7.0 (latest release)
   
   ### Query engine
   
   Athena
   
   ### Please describe the bug 🐞
   
   When both of the following criteria are met, queries for a renamed column 
return nulls instead of the original values:
   
   1. The data is in an existing Parquet file that is 'appended' to the table 
(using the Java SDK `appendFile()` function)
   2. The existing Parquet file was created by something _other_ than Iceberg
   
   As a contrived example, assume there exists in S3 a Parquet file with three 
columns and one row. This Parquet file is written by something other than the 
Iceberg libs (e.g. Apache Parquet lib v1.14.4).  This file is appended to a new 
(empty) Iceberg table called `example_table`.  Using the Java SDK:
   ```
   table.newAppend().appendFile(existingParquet).commit();
   ```
   
   Querying (in AWS Athena) from the table now returns the single row.
   ```
   select * from example_table;
   
   | first | second | third |
   +-------+--------+-------+
   | aaa   | bbb    | ccc   |
   ```
   
   Now rename column `second` to `renamed`. Using the Java SDK:
   ```
   table.updateSchema().renameColumn("second", "renamed").commit();
   ```
   
   Querying from the table _now_ returns the single row, but without the value 
in column `renamed`:
   ```
   select * from example_table;
   
   | first | renamed | third |
   +-------+---------+-------+
   | aaa   |         | ccc   |
   ```
   This is true when queried by both AWS Athena & Redshift Spectrum.
   
   Interestingly, if the existing Parquet file _was_ originally created by 
Iceberg (either via an `insert` query or using the Java SDK), this issue does 
not present. In that case, a query for the renamed column _does_ return the 
correct (original) value. Even if the Iceberg-created Parquet file is copied or 
moved from its original location before being appended to a new table, the 
column-rename works as expected. This suggests there is some unique 
(non-standard?) quality to Parquet files created by Iceberg, and the column 
rename operation relies on it.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [X] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to