Re: [PR] Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath [iceberg]

via GitHub Thu, 03 Jul 2025 14:31:50 -0700


singhpk234 commented on code in PR #13459:
URL: https://github.com/apache/iceberg/pull/13459#discussion_r2183779315



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java:
##########
@@ -312,22 +316,24 @@ private String rebuildMetadata() {
   }
 
   private String saveFileList(Set<Pair<String, String>> filesToMove) {
-    List<Tuple2<String, String>> fileList =
-        filesToMove.stream()
-            .map(p -> Tuple2.apply(p.first(), p.second()))
-            .collect(Collectors.toList());
-    Dataset<Tuple2<String, String>> fileListDataset =
-        spark().createDataset(fileList, Encoders.tuple(Encoders.STRING(), 
Encoders.STRING()));
     String fileListPath = stagingDir + RESULT_LOCATION;
-    fileListDataset
-        .repartition(1)
-        .write()
-        .mode(SaveMode.Overwrite)
-        .format("csv")
-        .save(fileListPath);
+    OutputFile fileList = table.io().newOutputFile(fileListPath);

Review Comment:
   [doubt] staging location default value is table metadata path, but can be 
set to anything ?
   
   if thats the case :
   1. what if table's fileIO didn't had the credentials to write to staging 
directory but spark did, would this cause failures ?
   2. when we are using the local disk to write this, but tables file IO was 
pointing to object store ? would now that workloads fail ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath [iceberg]

Reply via email to