Re: [PR] Flink: Support parallel file search in deleting orphaned files [iceberg]

via GitHub Fri, 12 Sep 2025 07:59:40 -0700


Guosmilesmile commented on PR #13998:
URL: https://github.com/apache/iceberg/pull/13998#issuecomment-3285626214


   > Thanks for the PR @Guosmilesmile! This functionality comes with some 
additional complexity. It changes the file discovery logic to be concurrent, 
which could create non-deterministic behavior. I'm wondering if you could 
describe the motivation behind this change? I'm guessing that the 
single-threaded approach was too slow in your HDFS setup. Could you share any 
numbers?
   
   Hi @mxm Yes, in my scenario, with multi-level partitioning such as 
hour(time), bucket(30, a), and bucket(30, b), the number of folders can become 
very large. When the data retention period is particularly long, the time taken 
to execute a single list operation can be around 10 minutes. If the data 
retention period is extended further, the time required to run a task will 
increase significantly. 
   
   If you think parallel deletion would add complexity, is there a need to set 
the default value of `usePrefixListing` to `true`, so that the non-parallel way 
is provided by default, and users can set it to parallel themselves?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Flink: Support parallel file search in deleting orphaned files [iceberg]

Reply via email to