Re: [PR] Spark 3.5: Spark action to compute the partition stats [iceberg]

via GitHub Thu, 18 Apr 2024 08:55:08 -0700


ajantha-bhat commented on PR #9437:
URL: https://github.com/apache/iceberg/pull/9437#issuecomment-2064341275


   @aokolnychyi: 
   
   Please find the new PR that just works on local algorithm. 
   https://github.com/apache/iceberg/pull/10176 
   
   
   >
   - We should focus on the local implementation for now. I think it is going 
to perform OK for most use cases and doing an efficient distributed 
implementation would be fairly hard. Even if we come up with that, the cost of 
transferring the results back to the driver may overweight everything else. 
Let's focus on the local implementation.
   - If we stay local, we may skip the action and provide 
PartitionStatsGenerator or something similar in core.
   - It is possible that some of the snapshots will be expired by the time we 
compute partition stats. Therefore, we will not be able to determine the last 
snapshot that modified some partitions. It is OK but the algorithm should 
account for that.
   - The partition summaries should be sorted before they are written out (as 
required by the spec).
   
   Done
   
   >
   - The snapshot ID is random and there may be clock skew that affects the 
commit timestamp. We should be relying on snapshot ordinals like in CDC scans 
to determine the snapshot order.
   
   This is not applicable for the current logic right? As I am using 
`table.currentSnapshot()`.  This may be needed when we introduce incremental 
update in the future. 
   
   >
   - Different partitions may have the same unified partition tuple but it does 
not make them the same. For instance, I may have a spec1 with p1=a and a spec2 
with p1=a/p2=null. Their unified partition tuples are the same but we cannot 
squash them into one summary entry. Instead, we should persist them separately 
with different spec IDs. This means the algorithm should be adjusted. Keep in 
mind that PartitionMap is not thread-safe and cannot be used globally.
   
   I didn't get this point, the unified tuple design is similar to partitions 
metadata table. Also, if we consider the users of partition stats, they finally 
want to know the stats for the filter query where `p1=a/p2=null`, so it should 
be for unified tuple as query doesn't applicable only to new spec (instead 
whole data)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.5: Spark action to compute the partition stats [iceberg]

Reply via email to