Re: [PR] Spark 3.5: Fix broadcasting specs in RewriteTablePath [iceberg]

via GitHub Thu, 23 Jan 2025 11:37:05 -0800


amogh-jahagirdar commented on code in PR #11982:
URL: https://github.com/apache/iceberg/pull/11982#discussion_r1927584179



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java:
##########
@@ -728,4 +710,13 @@ private String getMetadataLocation(Table tbl) {
         !metadataDir.isEmpty(), "Failed to get the metadata file root 
directory");
     return metadataDir;
   }
+
+  @VisibleForTesting
+  Broadcast<Table> tableBroadcast() {
+    if (tblBroadcast == null) {

Review Comment:
   I checked out the code locally, see my comment above. I think we can just 
remove the `tableBroadcast` argument for these helper functions, it's all just 
local state anyways so I think we can just call `tableBroadcast()` in the 
places that need it rather than passing it through as arguments. That'll solve 
the naming issue.
   
   On the whole testing aspect my take is that we really just need separate 
tests for SerializableTable in Spark (To make sure it works with things like 
Kryo). Once that component has confidence that fields like partition specs can 
be serialized, actions that depend on it don't really need to explicitly test 
broadcast/serialization cases and it can be safely assumed. Although I 
recognize that the test in this case was necessary to demonstrate the issue 
since we had a separate broadcast for the specs.
   
   Again though, I'm good to just keep the test as is for now.



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java:
##########
@@ -572,21 +564,12 @@ private void rewritePositionDeletes(TableMetadata 
metadata, Set<DeleteFile> toRe
     Dataset<DeleteFile> deleteFileDs =
         spark().createDataset(Lists.newArrayList(toRewrite), 
deleteFileEncoder);
 
-    Broadcast<Table> serializableTable = 
sparkContext().broadcast(SerializableTable.copyOf(table));
-    Broadcast<Map<Integer, PartitionSpec>> specsById =
-        sparkContext().broadcast(metadata.specsById());
-
     PositionDeleteReaderWriter posDeleteReaderWriter = new 
SparkPositionDeleteReaderWriter();
     deleteFileDs
         .repartition(toRewrite.size())
         .foreach(
             rewritePositionDelete(
-                serializableTable,
-                specsById,
-                sourcePrefix,
-                targetPrefix,
-                stagingDir,
-                posDeleteReaderWriter));
+                tableBroadcast(), sourcePrefix, targetPrefix, stagingDir, 
posDeleteReaderWriter));

Review Comment:
   rewritePositionDelete is a private helper function, can we just call 
tableBroadcast() in that function rather than have it be an argument. This will 
also help address the whole private field masking issue



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.5: Fix broadcasting specs in RewriteTablePath [iceberg]

Reply via email to