Re: [PR] Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath [iceberg]

via GitHub Thu, 03 Jul 2025 15:17:13 -0700


NikitaMatskevich commented on code in PR #13459:
URL: https://github.com/apache/iceberg/pull/13459#discussion_r2183847443



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java:
##########
@@ -312,22 +316,24 @@ private String rebuildMetadata() {
   }
 
   private String saveFileList(Set<Pair<String, String>> filesToMove) {
-    List<Tuple2<String, String>> fileList =
-        filesToMove.stream()
-            .map(p -> Tuple2.apply(p.first(), p.second()))
-            .collect(Collectors.toList());
-    Dataset<Tuple2<String, String>> fileListDataset =
-        spark().createDataset(fileList, Encoders.tuple(Encoders.STRING(), 
Encoders.STRING()));
     String fileListPath = stagingDir + RESULT_LOCATION;
-    fileListDataset
-        .repartition(1)
-        .write()
-        .mode(SaveMode.Overwrite)
-        .format("csv")
-        .save(fileListPath);
+    OutputFile fileList = table.io().newOutputFile(fileListPath);

Review Comment:
   Hi, thank you for the review!
   1) As I checked after the review, this case is impossible: in this case the 
action would fail earlier, because it uses table IO to create a copy of the 
metadata already, and the copies of the metadata files should reside in a 
staging dir as well
   2) Impossible for the same reason. 
   
   You can find examples in th method rewriteVersionFile or similar



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath [iceberg]

Reply via email to