Re: [I] [feat] `add_files` support parquet files with field ids [iceberg-python]

via GitHub Tue, 15 Oct 2024 06:20:47 -0700


MrDerecho commented on issue #1227:
URL: 
https://github.com/apache/iceberg-python/issues/1227#issuecomment-2413900767


   For context: right now I manage a very large data lake of time partitioned
   data.  The use case has to do with the archival process put into place
   wherein after a rolling period of time these files are “deleted” from the
   table and copied (before snapshot expiry) into a archive prefix for later
   use (if needed) and subject to a lifecycle policy I.e. glacier to physical
   deletion after some time.
   
   Because I use Trino to optimize- I have a mix of pyiceberg batch files and
   Trino spark generated files that will may require being “re-added” at a
   later date.  Let me know if you have any other questions.  Thanks for the
   help.
   
   On Tue, Oct 15, 2024 at 9:07 AM Sung Yun ***@***.***> wrote:
   
   > Thanks for raising this @MrDerecho <https://github.com/MrDerecho> - in
   > the initial version of add_files, we wanted to limit it to just parquet
   > files that that were created in an external system. The assumption is that
   > unless the files are created by an Iceberg client and are cognizant of the
   > Iceberg schema, there would be no way for the parquet writing process to be
   > use the correct field IDs in the produced parquet schema.
   >
   > Currently, if I am using pyiceberg to create/maintain my iceberg tables
   > and I use Trino (AWS Athena) to do compaction on the same (using Spark)-
   > the files created via compaction are unable to be "re-added" using the
   > add_files method at a later time.
   >
   > This sounds like a really cool use case, but I'd like to understand it
   > better - why isn't the application (Trino/Spark) that is doing the
   > compaction committing the compacted files into Iceberg itself?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > 
<https://github.com/apache/iceberg-python/issues/1227#issuecomment-2413868478>,
   > or unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AR6JWWZXATUGYZBS6TFHA4TZ3UHSLAVCNFSM6AAAAABPY5DDRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTHA3DQNBXHA>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [feat] `add_files` support parquet files with field ids [iceberg-python]

Reply via email to