Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]

via GitHub Mon, 16 Oct 2023 16:38:12 -0700


stevenzwu commented on issue #8847:
URL: https://github.com/apache/iceberg/issues/8847#issuecomment-1765429213

@rdblue here is the recap from the discussions on the PR #7161.
https://github.com/apache/iceberg/pull/7161#issuecomment-1761169778

PR #7161 automatically apply the custom bucketing partitioner to distribute
buckets to writer tasks in a balanced way.[ It only looks at the bucket
column](https://github.com/apache/iceberg/blob/main/flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java#L528-L531)
(ignoring other partition columns) with the assumption that the bucket column
is the main thing we need to distribute.

But a user reports that they have a partition spec like date, hour, minute,
bucket(8). PR #7161 imposed a new default behavior that changed the
distribution from simple keyBy on tuples with all partition columns to a custom
partitioner with only bucket column. To me, the partition strategy is
questionable. Bucket column is used here mainly to work around the skewed data
distribution across partition columns and unbalanced value distribution from
simple keyBy.

In the end, I feel it is safer to revert the behavior change from PR #7161
and ask users to manually apply the customer partitioner for the bucket column.
Previously, we were thinking about automatically enable it when the partition
spec has a bucketing column.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]

Reply via email to