rahil-c commented on PR #7914:
URL: https://github.com/apache/iceberg/pull/7914#issuecomment-2208107487

   Hi all sorry for the delay on this issue, been engaged in many internal 
things at work so did not get time to revisit this. 
   
   Originally when I encountered this issue it was a very specific feature I 
was working on with AWS LakeFormation and Iceberg integration hence I opened 
this PR, to solve that issue. It seems there are several people however that 
have been hitting issues around this Remove OrphanFile Procedure but unsure as 
to if its exactly the same issue that I mentioned in the overview. 
   
   In terms of the following issue `No FileSystem for scheme "s3".`, my 
understanding is the remove orphan file procedure is invoking the hadoop file 
system,  and if a user is trying to read a s3 path, hadoop does not understand 
naturally what this file scheme is. 
https://github.com/apache/iceberg/blob/main/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java#L356
   
   The mitigation would be to likely leverage `hadoop-aws` jar and configure 
spark with the appropriate hadoop aws configurations. In the iceberg aws docs: 
https://github.com/apache/iceberg/blob/main/docs/docs/aws.md#hadoop-s3a-filesystem
   ```
   Add 
[hadoop-aws](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) 
as a runtime dependency of your compute engine.
   Configure AWS settings based on [hadoop-aws 
documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)
 (make sure you check the version, S3A configuration varies a lot based on the 
version you use).
   ```
   I think in users spark configurations they can try adding
   `"spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"`, as I 
saw a similar thread here: 
https://apache-iceberg.slack.com/archives/C03LG1D563F/p1656918500567629
   
   As for landing this PR will see if I can add tests based on @RussellSpitzer 
feedback.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to