[I] Parquet.write to S3 with GlueCatalog requires commit [iceberg]

via GitHub Tue, 10 Oct 2023 08:28:01 -0700


djchapm opened a new issue, #8767:
URL: https://github.com/apache/iceberg/issues/8767


   ### Feature Request / Improvement
   
   Hi, writing this in an effort to improve documentation - I spent a crazy 
amount of time writing to glue catalog and parquet-avro files in S3 with 
Iceberg, but could never query the data using Athena.  I thought it had to do 
with all the missing metadata on the glue tables - but this was a red herring.  
Problem was that writing files does not automatically update metadata.  
According to the API, if you use Table.io(): 
   
   
![image](https://github.com/apache/iceberg/assets/9857153/2a61401d-b45c-42e4-9881-345311509479)
   
   This made me think using an OutputFile via Table.io() would update metadata. 
 My usage:
   
   ```
                   OutputFile outputFile = table.io().newOutputFile(location);
                   appenderLocation.put(messageType, location);
                   FileAppender<GenericRecord> appender = 
Parquet.write(outputFile)
                           .forTable(table)
                           .setAll(propsBuilder)
                           .createWriterFunc(ParquetAvroWriter::buildWriter)
                           .build();
   ```
   
   On closing the appender - the file writes but there are no updates to 
metadata.  My table is from GlueCatalog.loadTable().  I'm new - but I could not 
find anywhere that you have to then lookup the file again as an InputFile, 
create a transcation on the table and commit it:
   
                ```
                   log.info("Closing appender for message type {}", key);
                   value.close(); //Appender from above
                   //one attempt, does nothing:
                   //  tables.get(key).rewriteManifests();
                   log.info("Commiting {} file {}", key, 
appenderLocation.get(key));
                   InputFile inputFile = 
tables.get(key).io().newInputFile(appenderLocation.get(key));
                   DataFile dataFile = DataFiles.builder(tables.get(key).spec())
                           .withInputFile(inputFile)
                           .withMetrics(value.metrics())
                           .withFormat(FileFormat.PARQUET)
                           .build();
                   Transaction t = tables.get(key).newTransaction();
                   t.newAppend().appendFile(dataFile).commit();
                   // commit all changes to the table
                   t.commitTransaction();
   ```
   
   So would like improvements with respect to documentation and AWS integration 
for writing Parquet data using GlueCatalog.  Or at least a test or example 
people could follow for writing files and updating corresponding catalog 
metadata using public APIs (Junits do all kinds of metadata updates but with 
protected APIs we cannot access).
   
   Let me know your thoughts.
   
   
   ### Query engine
   
   Athena


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Parquet.write to S3 with GlueCatalog requires commit [iceberg]

Reply via email to