[PR] Support partitioning spec during data file rewrites in Spark. [iceberg]

via GitHub Mon, 21 Oct 2024 07:26:37 -0700


rdsarvar opened a new pull request, #11368:
URL: https://github.com/apache/iceberg/pull/11368


   # Description
   Currently, data file rewrites supports specifying the output spec ID to be 
used. Added functionality to provide a partition spec itself and have it added 
as a non-default spec if it does not already exist on the table.
   
   # Benefits
   These changes would make it simpler to tier partition granularity by time 
ranges. As an example: Say your table is heavily used but mostly targets most 
recent data and you still want to provide the ability for folks to query back 
in time. You could achieve additional performance improvements by applying more 
granular partitions in the base table and then have a compaction job that runs 
by tiers:
   
   1. Short term compaction (reuses the table definition - high granularity, 
get rid of as many small files as you can)
   2. Long term compaction (specified partition spec that is not the default - 
lower granularity, will cut down the metadata stored for the table)
   
   # Notes for Reviewers
   
   **Note: This is definitely not complete and I am open to all feedback. 
Whether some functionalities already exist outside OR if it should be done 
differently.**
   
   The part I'm mostly iffy on is modifying `BaseUpdatePartitionSpec.java` with 
`table.updateSpec()` instead of having something like 
`table.addSpec(partitionSpec). addNonDefaultSpec().commit()`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Support partitioning spec during data file rewrites in Spark. [iceberg]

Reply via email to