aokolnychyi commented on PR #9437:
URL: https://github.com/apache/iceberg/pull/9437#issuecomment-2036115472

   I cloned this change and played with it locally. Here are my thoughts.
   
   1. We should focus on the local implementation for now. I think it is going 
to perform OK for most use cases and doing an efficient distributed 
implementation would be fairly hard. Even if we come up with that, the cost of 
transferring the results back to the driver may overweight everything else. 
Let's focus on the local implementation.
   2. If we stay local, we may skip the action and provide 
`PartitionStatsGenerator` or something similar in core.
   3. It is possible that some of the snapshots will be expired by the time we 
compute partition stats. Therefore, we will not be able to determine the last 
snapshot that modified some partitions. It is OK but the algorithm should 
account for that.
   4. The snapshot ID is random and there may be clock skew that affects the 
commit timestamp. We should be relying on snapshot ordinals like in CDC scans 
to determine the snapshot order.
   5. Different partitions may have the same unified partition tuple but it 
does not make them the same. For instance, I may have a spec1 with `p1=a` and a 
spec2 with `p1=a/p2=null`. Their unified partition tuples are the same but we 
cannot squash them into one summary entry. Instead, we should persist them 
separately with different spec IDs. This means the algorithm should be 
adjusted. Keep in mind that `PartitionMap` is not thread-safe and cannot be 
used globally.
   6. The partition summaries should be sorted before they are written out (as 
required by the spec).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to