stevenzwu commented on issue #8847:
URL: https://github.com/apache/iceberg/issues/8847#issuecomment-1765429213

   @rdblue here is the recap from the discussions on the PR #7161. 
https://github.com/apache/iceberg/pull/7161#issuecomment-1761169778
   
   PR #7161 automatically apply the custom bucketing partitioner to distribute 
buckets to writer tasks in a balanced way.[ It only looks at the bucket 
column](https://github.com/apache/iceberg/blob/main/flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java#L528-L531)
 (ignoring other partition columns) with the assumption that the bucket column 
is the main thing we need to distribute. 
   
   But a user reports that they have a partition spec like date, hour, minute, 
bucket(8). PR #7161 imposed a new default behavior that changed the 
distribution from simple keyBy on tuples with all partition columns to a custom 
partitioner with only bucket column. To me, the partition strategy is 
questionable. Bucket column is used here mainly to work around the skewed data 
distribution across partition columns and unbalanced value distribution from 
simple keyBy.
   
   In the end, I feel it is safer to revert the behavior change from PR #7161 
and ask users to manually apply the customer partitioner for the bucket column. 
Previously, we were thinking about automatically enable it when the partition 
spec has a bucketing column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to