[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

via GitHub Tue, 23 May 2023 09:37:15 -0700


dramaticlly commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1202686519



##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
         new Schema(
             Types.NestedField.required(1, "partition", 
Partitioning.partitionType(table)),
             Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+            Types.NestedField.required(
+                9,
+                "last_updated_at",
+                Types.TimestampType.withZone(),
+                "Partition last updated timestamp"),
+            Types.NestedField.required(
+                10,
+                "last_updated_snapshot_id",

Review Comment:
   > a) Is it a good idea to keep the snapshot id? Because regularly running 
expire_snapshots can clean up the snapshots and we may not be able to map what 
operation these files were created from, even with the snapshot id.
   > 
   > b) There was also an ask for "latest sequence number" associated with that 
partition from the community users during partition stats discussion.
   > 
   > Do you think modified time is enough and no need for the sequence number?
   
   My initial thought process is like the last updated timestamp is helpful by 
itself but if there's doubt around the timestamp, it's better to provide a 
reference to allow for further investigation. Here we derived last updated 
timestamp from snapshot, so providing snapshotId enable a way to look up 
further information about snapshot (if it's a rewrite data operation or is it 
an append from late arrival data).
   
   With respect to the periodic snapshot expiration, I think partition can have 
null snapshot based on referenced snapshotId if it was already expired, but it 
seems only applicable to your data outlive your snapshot. i.e if you run data 
compaction along side your snapshot expiration, or if you also periodically 
delete your partition (like if it's daily partitioned and your dataset have a 
retention period) together with your snapshot expiration, it seem to be fine. 
   
   As for file sequence number, I think it might be helpful but by itself it 
seem to be hard to use compare to timestamp and snapshotId.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #7581: Add last updated timestamp and snapshotId for partition table

Reply via email to