Re: [PR] Flink: handle rescale properly for range bounds in sketch statistics [iceberg]

via GitHub Mon, 10 Jun 2024 14:03:23 -0700


stevenzwu commented on PR #10457:
URL: https://github.com/apache/iceberg/pull/10457#issuecomment-2159279193


   > > > I think we can overlay the hash distribution above the ranges,
   > > 
   > > 
   > > Not sure if we want that, hash distribution (keyBy) is simple and low 
overhead. Range distribution requires statistics collection.
   > 
   > If you partition your orders by time, and need to update the order if it 
was canceled, the your key/partition is not equally distributed, and hashing is 
probably not a good option.
   > 
   > I would like to see the range partitioning as a precursor for writing 
ordered files with Flink. If we use similar constructs as the Flink SQL 
ORDER_BY then we can order the rows before writing them out. If we want to do 
this for CDC streams then we need to send the records with the same id to the 
same subtask.
   > 
   > Again, not something immediate, but might worth revisiting later.
   
   I am happy to discuss it here. Agree that it is not sth we want to address 
in this PR.
   
   ORDER_BY is essentially the SortOrder defined in table properties. Note that 
currently Flink writer doesn't sort rows within a data file. Range partitioner 
only range split keys across files for better clustering.
   
   In the example of orders table partitioned by time (say hourly), the primary 
keys would be `(hour(ts), order_id)` tuple. hash distribution (keyBy) can work 
and ensure correctness. You are saying range distribution would work better, 
because the better clustering of `order_id` within an hourly partition, 
correct? With hash distribution, each subtask writes the number of data files 
equals to the number of hours in a checkpoint cycle. Range distribution can 
handles the event time skew (recent hours have more data than distant past). 
Yes, I agree with that assessment. There is some challenges with handling new 
values (like new hours as time advances). Right now, it is handled by 
round-robin, as there is no assumption of `SortOrder` should be handled as 
primary keys. Primary keys are better/safer handled as hash distribution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink: handle rescale properly for range bounds in sketch statistics [iceberg]

Reply via email to