stevenzwu commented on PR #10457: URL: https://github.com/apache/iceberg/pull/10457#issuecomment-2159279193
> > > I think we can overlay the hash distribution above the ranges, > > > > > > Not sure if we want that, hash distribution (keyBy) is simple and low overhead. Range distribution requires statistics collection. > > If you partition your orders by time, and need to update the order if it was canceled, the your key/partition is not equally distributed, and hashing is probably not a good option. > > I would like to see the range partitioning as a precursor for writing ordered files with Flink. If we use similar constructs as the Flink SQL ORDER_BY then we can order the rows before writing them out. If we want to do this for CDC streams then we need to send the records with the same id to the same subtask. > > Again, not something immediate, but might worth revisiting later. I am happy to discuss it here. Agree that it is not sth we want to address in this PR. ORDER_BY is essentially the SortOrder defined in table properties. Note that currently Flink writer doesn't sort rows within a data file. Range partitioner only range split keys across files for better clustering. In the example of orders table partitioned by time (say hourly), the primary keys would be `(hour(ts), order_id)` tuple. hash distribution (keyBy) can work and ensure correctness. You are saying range distribution would work better, because the better clustering of `order_id` within an hourly partition, correct? With hash distribution, each subtask writes the number of data files equals to the number of hours in a checkpoint cycle. Range distribution can handles the event time skew (recent hours have more data than distant past). Yes, I agree with that assessment. There is some challenges with handling new values (like new hours as time advances). Right now, it is handled by round-robin, as there is no assumption of `SortOrder` should be handled as primary keys. Primary keys are better/safer handled as hash distribution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org