stevenzwu commented on PR #7161: URL: https://github.com/apache/iceberg/pull/7161#issuecomment-1762592732
@chenwyi2 Is your point that we shouldn't only consider bucketing column (like did in this PR). you just want a plain keyBy in this case? that would be a fair point. Do you get balanced traffic distribution among write tasks with simple keyBy? I am also wondering if the partition spec of dt,hour,minute and bucekt(id) is the best option. especially the minute column as partition. do you really need minute level partition granularity. you are creating very fine grained partitions. even with the most optimal data distribution/shuffle. there are still a lot of partitions and data files. you used 8 for bucket number. it seems quite small for bucketing. what's the reason of using 8 buckets? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org