ajantha-bhat commented on PR #9437: URL: https://github.com/apache/iceberg/pull/9437#issuecomment-2064341275
@aokolnychyi: Please find the new PR that just works on local algorithm. https://github.com/apache/iceberg/pull/10176 > - We should focus on the local implementation for now. I think it is going to perform OK for most use cases and doing an efficient distributed implementation would be fairly hard. Even if we come up with that, the cost of transferring the results back to the driver may overweight everything else. Let's focus on the local implementation. - If we stay local, we may skip the action and provide PartitionStatsGenerator or something similar in core. - It is possible that some of the snapshots will be expired by the time we compute partition stats. Therefore, we will not be able to determine the last snapshot that modified some partitions. It is OK but the algorithm should account for that. - The partition summaries should be sorted before they are written out (as required by the spec). Done > - The snapshot ID is random and there may be clock skew that affects the commit timestamp. We should be relying on snapshot ordinals like in CDC scans to determine the snapshot order. This is not applicable for the current logic right? As I am using `table.currentSnapshot()`. This may be needed when we introduce incremental update in the future. > - Different partitions may have the same unified partition tuple but it does not make them the same. For instance, I may have a spec1 with p1=a and a spec2 with p1=a/p2=null. Their unified partition tuples are the same but we cannot squash them into one summary entry. Instead, we should persist them separately with different spec IDs. This means the algorithm should be adjusted. Keep in mind that PartitionMap is not thread-safe and cannot be used globally. I didn't get this point, the unified tuple design is similar to partitions metadata table. Also, if we consider the users of partition stats, they finally want to know the stats for the filter query where `p1=a/p2=null`, so it should be for unified tuple as query doesn't applicable only to new spec (instead whole data)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org