dramaticlly commented on code in PR #7581:
URL: https://github.com/apache/iceberg/pull/7581#discussion_r1202686519
##########
core/src/main/java/org/apache/iceberg/PartitionsTable.java:
##########
@@ -47,6 +47,16 @@ public class PartitionsTable extends BaseMetadataTable {
new Schema(
Types.NestedField.required(1, "partition",
Partitioning.partitionType(table)),
Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
+ Types.NestedField.required(
+ 9,
+ "last_updated_at",
+ Types.TimestampType.withZone(),
+ "Partition last updated timestamp"),
+ Types.NestedField.required(
+ 10,
+ "last_updated_snapshot_id",
Review Comment:
> a) Is it a good idea to keep the snapshot id? Because regularly running
expire_snapshots can clean up the snapshots and we may not be able to map what
operation these files were created from, even with the snapshot id.
>
> b) There was also an ask for "latest sequence number" associated with that
partition from the community users during partition stats discussion.
>
> Do you think modified time is enough and no need for the sequence number?
My initial thought process is like the last updated timestamp is helpful by
itself but if there's doubt around the timestamp, it's better to provide a
reference to allow for further investigation. Here we derived last updated
timestamp from snapshot, so providing snapshotId enable a way to look up
further information about snapshot (if it's a rewrite data operation or is it
an append from late arrival data).
With respect to the periodic snapshot expiration, I think partition can have
null snapshot based on referenced snapshotId if it was already expired, but it
seems only applicable to your data outlive your snapshot. i.e if you run data
compaction along side your snapshot expiration, or if you also periodically
delete your partition (like if it's daily partitioned and your dataset have a
retention period) together with your snapshot expiration, it seem to be fine.
As for file sequence number, I think it might be helpful but by itself it
seem to be hard to use compare to timestamp and snapshotId.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]