[PR] Spark: when doing rewrite_data_files, check for partitioning schema compatibility [iceberg]

via GitHub Wed, 26 Mar 2025 03:38:14 -0700


adrians opened a new pull request, #12651:
URL: https://github.com/apache/iceberg/pull/12651


   Context - the implementation as seen in the current release allows for this 
kind of scenario:
   
   * We start with a table of 20TB of data-files, divided in 200 coarse 
partitions (200 partitions X 200 parquet files X 512MB).
   * We want to do a partitioning schema evolution, to split every partition in 
10 smaller partitions.
   * Doing a first `rewrite_data_files`, we obtain 200 file-groups (assigned 
randomly from the parquet-files) x 100GB each. This is caused by a bit of logic 
that says "if those files do not match the latest partitioning schema, assume 
they're unpartitioned".
   * After the first `rewrite_data_files`, we're left with 2000 fine 
partitions, but since every file-group can (at least in theory) write to every 
partition, the expected result is something like 2000 partitions x 200 files x 
50MB.
   *  At this point, we need to compact those data-files, so we run a second 
`rewrite_data_files`.
   * After the second `rewrite_data_files`, we're finally left with 2000 
partitions x 20 files x 512MB.
   
   This Pull-request proposes an algorithm that simplifies the scenario:
   
   When building the file-groups for the first `rewrite_data_files`, check if 
the old partitioning schema is a coarser variant of the current schema. The 
scenario now becomes:
   
   * Doing a first `rewrite_data_files`, we obtain 200 file-groups x 100GB each 
(based on the old partitioning schema).
   * After the first `rewrite_data_files`, we're left with 2000 fine 
partitions, but since every fine-partition can be obtained from a single parent 
old-partition, the expected result is something like 2000 partitions x 20 files 
x 512MB.
   * The second pass is not necessary. (In practice, if the coarse-partitions 
are slightly larger than 100GB, they might be split into 2 file-groups, so 
there might be some small parquet-files to compact, but this task is orders of 
magnitude faster now).
   
   This is a significant improvement in terms of time taken to apply the new 
partitioning schema.
   
   The criteria to determine if the new partitioning is "finer or the same" 
than the old partitioning look something like this:
   * the new (finer) partitioning spec has more (or the same) number of fields 
than the old (coarse) one;
   AND
   * the first N source-columns for the new (finer) partitioning spec must be 
the same as the N source-columns of the old partitioning-spec (N = number of 
fields in the old partition-spec)
   AND
   * the first N fields of the new (finer) partitioning spec must have "finer" 
transformations than the N fields in the old spec (N = number of fields in the 
old partition-spec) - see table below
   
   | if `old.field[i].transformation` is | then `new.field[i].transformation` 
is the same or more specific |
   | ------------ | ---------------------------------------- |
   | identity     | identity                                 |
   | year         | year<br>month<br>day<br>hour<br>identity |
   | month        | month<br>day<br>hour<br>identity         |
   | day          | day<br>hour<br>identity                  |
   | hour         | hour<br>identity                         |
   | truncate(x)  | truncate(y) **AND** y≥x<br>identity      |
   | bucket(x)    | bucket(y) AND y≥x AND y%x=0<br>identity  |
   
   For the third bullet-point in the list of criteria, I have found that the 
`boolean Transform.satisfiesOrderOf(Transform a)` method that implements that 
predicate pretty well - except maybe for the `bucket` case, for which it'll 
fall back to the "unpartitioned" scenario.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Spark: when doing rewrite_data_files, check for partitioning schema compatibility [iceberg]

Reply via email to