aokolnychyi commented on PR #9437: URL: https://github.com/apache/iceberg/pull/9437#issuecomment-2036115472
I cloned this change and played with it locally. Here are my thoughts. 1. We should focus on the local implementation for now. I think it is going to perform OK for most use cases and doing an efficient distributed implementation would be fairly hard. Even if we come up with that, the cost of transferring the results back to the driver may overweight everything else. Let's focus on the local implementation. 2. If we stay local, we may skip the action and provide `PartitionStatsGenerator` or something similar in core. 3. It is possible that some of the snapshots will be expired by the time we compute partition stats. Therefore, we will not be able to determine the last snapshot that modified some partitions. It is OK but the algorithm should account for that. 4. The snapshot ID is random and there may be clock skew that affects the commit timestamp. We should be relying on snapshot ordinals like in CDC scans to determine the snapshot order. 5. Different partitions may have the same unified partition tuple but it does not make them the same. For instance, I may have a spec1 with `p1=a` and a spec2 with `p1=a/p2=null`. Their unified partition tuples are the same but we cannot squash them into one summary entry. Instead, we should persist them separately with different spec IDs. This means the algorithm should be adjusted. Keep in mind that `PartitionMap` is not thread-safe and cannot be used globally. 6. The partition summaries should be sorted before they are written out (as required by the spec). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org