karuppayya commented on PR #13084: URL: https://github.com/apache/iceberg/pull/13084#issuecomment-2956798998
> orphanFiles().mapPartitions(DeleteOprhanFilesSparkAction.distributedDeleteFunction).collect https://iceberg.apache.org/docs/latest/spark-procedures/#remove_orphan_files `equal_schemes`, `equivalent_schemes`, `prefix_mismatch_mode` The validations happens as part of the action(based on the accumulator state) These action's option would be become a no-op with the dataframe generated from the action. > we are hard coding in coalesce 10 , etc This is to control parallelism and there-in throttling. I have a task on me to make this configurable. > "Cache" and "count' pattern here would be make this much more expensive for smaller remove orphan files counts Yes, you are right that most of these changes are just trying to matching the the old behavior, which i think we still need. We are not introducing new computation(amount of intermediate data generated from shuffle should be same as before) but in addition we would be caching to disk. I dont think we would see a considerable runtime impact for a job with smaller file count. (Note: Dataframe is not un-cached at the the end of action for which i have a task for my self on the PR) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org