ajantha-bhat commented on PR #9437: URL: https://github.com/apache/iceberg/pull/9437#issuecomment-1976956134
I did some benchmarking using the `FileGenerationUtil` (changes included in the PR `TestPartitionStatsPerf`). **Looks like the local algorithm is performant compared to distributed one.** ``` case 1: FileGenerationUtil.generateDataFile took 30 minutes to generate 10k partitions with 2 data file entry for each partition. 1.4 seconds - local algorithm 3.3 seconds - distributed algorithm case 2: FileGenerationUtil.generateDataFile took 25 seconds to generate 20 paritions with 10K data file entry for each partition. 1.7 seconds - local algorithm 4.1 seconds - distributed algorithm ``` Note: For case 1, I can increase the number of partition some more, but the generation takes hours. Will try it out at night. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org