HonahX commented on code in PR #667: URL: https://github.com/apache/iceberg-python/pull/667#discussion_r1650524201
########## pyiceberg/table/snapshots.py: ########## @@ -226,7 +226,8 @@ def __eq__(self, other: Any) -> bool: class Snapshot(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", default=None) - sequence_number: Optional[int] = Field(alias="sequence-number", default=None) + # cannot import `INITIAL_SEQUENCE_NUMBER` due to circular import + sequence_number: Optional[int] = Field(alias="sequence-number", default=0) Review Comment: @kevinjqliu Thanks for spotting this! We definitely need to read snapshot.sequence_number as 0 for v1. However, as we have observed in the test outcome, making `sequence_number` default to 0 here leads to `sequence_number=0` be written to version 1 table metada's snapshots, which is not allowed by spec: ``` Writing v1 metadata: Snapshot field sequence-number should not be written ``` I think we may need a new [field_serializer](https://docs.pydantic.dev/latest/api/functional_serializers/#pydantic.functional_serializers.field_serializer) in `TableMetadataCommonFields` class or some other ways to correct the behavior on write. WDYT? ########## pyiceberg/table/__init__.py: ########## @@ -3827,6 +3827,40 @@ def _partition_summaries_to_rows( schema=manifest_schema, ) + def metadata_log_entries(self) -> "pa.Table": + import pyarrow as pa + + from pyiceberg.table.snapshots import MetadataLogEntry + + table_schema = pa.schema([ + pa.field("timestamp", pa.timestamp(unit="ms"), nullable=False), + pa.field("file", pa.string(), nullable=False), + pa.field("latest_snapshot_id", pa.int64(), nullable=True), + pa.field("latest_schema_id", pa.int32(), nullable=True), + pa.field("latest_sequence_number", pa.int64(), nullable=True), + ]) + + def metadata_log_entry_to_row(metadata_entry: MetadataLogEntry) -> Dict[str, Any]: + latest_snapshot = self.tbl.snapshot_as_of_timestamp(metadata_entry.timestamp_ms) + return { + "timestamp": metadata_entry.timestamp_ms, + "file": metadata_entry.metadata_file, + "latest_snapshot_id": latest_snapshot.snapshot_id if latest_snapshot else None, + "latest_schema_id": latest_snapshot.schema_id if latest_snapshot else None, + "latest_sequence_number": latest_snapshot.sequence_number if latest_snapshot else None, + } + + # imitates `addPreviousFile` from Java + # https://github.com/apache/iceberg/blob/8248663a2a1ffddd2664ea37b45882455466f71c/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1450-L1451 + metadata_log_entries = self.tbl.metadata.metadata_log + [ + MetadataLogEntry(metadata_file=self.tbl.metadata_location, timestamp_ms=self.tbl.metadata.last_updated_ms) Review Comment: It seems this line acts more like https://github.com/apache/iceberg/blob/8a70fe0ff5f241aec8856f8091c77fdce35ad256/core/src/main/java/org/apache/iceberg/MetadataLogEntriesTable.java#L62-L66. Just curious the reason that you mention `addPreviousFile` here, which seem to be more relevant when we update the metadata_log during table commit. BTW, this reminds me that currently non-rest catalog does not update the metadata_log field during commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org