W-I-D-EE opened a new issue, #8339: URL: https://github.com/apache/iceberg/issues/8339
### Query engine Spark 3.2.3 ### Question We have a scenario where we need to export data files into long term tape storage, but still need to maintain the ability to re-add those datafiles if the data in question is needed. Based on our current understanding of Iceberg our procedure is as follows. Exporting 1. Copy Data Files and its partition folders to tape storage. 2. Execute the deleteFile API on each of the exported datafiles to remove them from the iceberg table Reloading 1. Copy File Structure back into the folder of the iceberg table data folder. 2. Execute the add_files on the partition folders that were copied into the system. Based on what i described does anyone see potential issues with this approach. Is there something better recommended? One thing i guess im concerned about is how add_files would handle partition/spec evolution. Simple example being that lets say when we exported the data we had a bucket size of 16, but a year later when we went to reimport the data our table spec now uses a bucket size of 32. Is this a problem? Would we need to essentially rewrite the archived data files to match the current existing spec? This is probably an unorthodox setup, but my situation is isolated environments with limited data storage resources, so there is a need to be able to move data around as its needed, but more importantly to make room for new data being generated. Appreciate any feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
