yegangy0718 opened a new pull request, #6382: URL: https://github.com/apache/iceberg/pull/6382
This PR is created as part of issue https://github.com/apache/iceberg/issues/6303 and project https://github.com/apache/iceberg/projects/27 In this PR, we focus on bin packing based on traffic distribution statistics. This works well for skewed data on partition columns (like event time). This requires calculating traffic distribution statistics across partition columns and use the statistics to guide shuffling decision. Changes: 1. Implement ShuffleOperator which will be added before Iceberg Writer operator to collect data distribution based on key(generated from provided KeySelector) 2. Implement ShuffleRecordWrapper which contains either the record for data distribution information I will have following up PRs to implement ShuffleCoordinator, the data distribution sending and receiving logic between coordinator and operator, and etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org