karuppayya commented on PR #13084:
URL: https://github.com/apache/iceberg/pull/13084#issuecomment-2956798998

   > 
orphanFiles().mapPartitions(DeleteOprhanFilesSparkAction.distributedDeleteFunction).collect
   
   
   https://iceberg.apache.org/docs/latest/spark-procedures/#remove_orphan_files
   `equal_schemes`, `equivalent_schemes`, `prefix_mismatch_mode`
   The validations happens as part of the action(based on the accumulator state)
   These action's option would be become a no-op with the dataframe generated 
from the action.
   
   > we are hard coding in coalesce 10 , etc
   
   This is to control parallelism and there-in throttling. 
   I have a task on me to make this configurable.
   
   > "Cache" and "count' pattern here would be make this much more expensive 
for smaller remove orphan files counts
   
   Yes, you are right that most of these changes are just trying to matching 
the the old behavior, which i think we still need.
   We are not introducing new computation(amount of intermediate data generated 
from shuffle should be same as before) but in addition we would be caching to 
disk.
   I dont think we would see a considerable runtime impact for a job with 
smaller file count.
   (Note: Dataframe is not un-cached at the the end of action for which i have 
a task for my self on the PR)
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to