MrDerecho commented on issue #1227: URL: https://github.com/apache/iceberg-python/issues/1227#issuecomment-2413900767
For context: right now I manage a very large data lake of time partitioned data. The use case has to do with the archival process put into place wherein after a rolling period of time these files are “deleted” from the table and copied (before snapshot expiry) into a archive prefix for later use (if needed) and subject to a lifecycle policy I.e. glacier to physical deletion after some time. Because I use Trino to optimize- I have a mix of pyiceberg batch files and Trino spark generated files that will may require being “re-added” at a later date. Let me know if you have any other questions. Thanks for the help. On Tue, Oct 15, 2024 at 9:07 AM Sung Yun ***@***.***> wrote: > Thanks for raising this @MrDerecho <https://github.com/MrDerecho> - in > the initial version of add_files, we wanted to limit it to just parquet > files that that were created in an external system. The assumption is that > unless the files are created by an Iceberg client and are cognizant of the > Iceberg schema, there would be no way for the parquet writing process to be > use the correct field IDs in the produced parquet schema. > > Currently, if I am using pyiceberg to create/maintain my iceberg tables > and I use Trino (AWS Athena) to do compaction on the same (using Spark)- > the files created via compaction are unable to be "re-added" using the > add_files method at a later time. > > This sounds like a really cool use case, but I'd like to understand it > better - why isn't the application (Trino/Spark) that is doing the > compaction committing the compacted files into Iceberg itself? > > — > Reply to this email directly, view it on GitHub > <https://github.com/apache/iceberg-python/issues/1227#issuecomment-2413868478>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AR6JWWZXATUGYZBS6TFHA4TZ3UHSLAVCNFSM6AAAAABPY5DDRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTHA3DQNBXHA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org