Re: [PR] Spark: when doing rewrite_data_files, check for partitioning schema compatibility [iceberg]

via GitHub Sat, 05 Apr 2025 02:23:05 -0700


pvary commented on code in PR #12651:
URL: https://github.com/apache/iceberg/pull/12651#discussion_r2026871589



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java:
##########
@@ -227,10 +227,11 @@ private StructLikeMap<List<FileScanTask>> 
groupByPartition(
 
     for (FileScanTask task : tasks) {
       // If a task uses an incompatible partition spec the data inside could 
contain values
-      // which belong to multiple partitions in the current spec. Treating all 
such files as
-      // un-partitioned and grouping them together helps to minimize new files 
made.
+      // which belong to multiple partitions in the current spec.
       StructLike taskPartition =
-          task.file().specId() == table.spec().specId() ? 
task.file().partition() : emptyStruct;
+          
table.spec().equalOrFinerThan(table.specs().get(task.file().specId()))
+              ? task.file().partition()
+              : emptyStruct;

Review Comment:
   We are in the process to refactoring out the compaction planning part to the 
core module.
   Please make sure that any changes here land in the 
`BinPackRewriteFilePlanner` too:
   
https://github.com/apache/iceberg/blob/d5971429ea903be873b5884c64a3dd41076179ea/core/src/main/java/org/apache/iceberg/actions/BinPackRewriteFilePlanner.java#L279-L287
   
   FWIW, i have an open PR to move the Spark compaction to the new API (#12692) 
which will remove the planning from here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: when doing rewrite_data_files, check for partitioning schema compatibility [iceberg]

Reply via email to