Guosmilesmile commented on PR #13998: URL: https://github.com/apache/iceberg/pull/13998#issuecomment-3285626214
> Thanks for the PR @Guosmilesmile! This functionality comes with some additional complexity. It changes the file discovery logic to be concurrent, which could create non-deterministic behavior. I'm wondering if you could describe the motivation behind this change? I'm guessing that the single-threaded approach was too slow in your HDFS setup. Could you share any numbers? Hi @mxm Yes, in my scenario, with multi-level partitioning such as hour(time), bucket(30, a), and bucket(30, b), the number of folders can become very large. When the data retention period is particularly long, the time taken to execute a single list operation can be around 10 minutes. If the data retention period is extended further, the time required to run a task will increase significantly. If you think parallel deletion would add complexity, is there a need to set the default value of `usePrefixListing` to `true`, so that the non-parallel way is provided by default, and users can set it to parallel themselves? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
