Re: [PR] Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath [iceberg]

via GitHub Tue, 08 Jul 2025 08:22:34 -0700


NikitaMatskevich commented on code in PR #13459:
URL: https://github.com/apache/iceberg/pull/13459#discussion_r2192822188



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java:
##########
@@ -312,22 +313,25 @@ private String rebuildMetadata() {
   }
 
   private String saveFileList(Set<Pair<String, String>> filesToMove) {

Review Comment:
   Could you share your thoughts on the  nature of the test that you would like 
to see for this change? 
   
   If you want me to reproduce the real-life issue that we've had, then I will 
need to implement the setup in the iceberg-spark test module where native 
iceberg file io has enough permissions to the filesystem and hadoop doesn't. It 
should be something very close to the azure setup with managed identities that 
we've had, or maybe there are other complex edge-cases that I am not aware of. 
Anyway, do you think that such effort would be justified for this tiny PR ? It 
does not introduce any important changes. Usage of native table IO in this 
context cannot really be bug-prone by itself, because the modified class 
already uses exactly the same file io instance to run other steps of the action.
   
   If you are concerned with the test coverage of this method, in 
[TestRewriteTablePathsAction.java](https://github.com/apache/iceberg/blob/main/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java)
 there are many good tests, and every one of them passes through the step of 
saving a file list somewhere and reading the files from this file list to copy 
them afterwards. I can see that the file list is created allright on my machine 
and if I compare it to the spark-generated one, it has exactly the same 
structure and file contents.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath [iceberg]

Reply via email to