Re: [PR] Metadata Log Entries metadata table [iceberg-python]

via GitHub Mon, 24 Jun 2024 00:57:29 -0700


HonahX commented on code in PR #667:
URL: https://github.com/apache/iceberg-python/pull/667#discussion_r1650524201



##########
pyiceberg/table/snapshots.py:
##########
@@ -226,7 +226,8 @@ def __eq__(self, other: Any) -> bool:
 class Snapshot(IcebergBaseModel):
     snapshot_id: int = Field(alias="snapshot-id")
     parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", 
default=None)
-    sequence_number: Optional[int] = Field(alias="sequence-number", 
default=None)
+    # cannot import `INITIAL_SEQUENCE_NUMBER` due to circular import
+    sequence_number: Optional[int] = Field(alias="sequence-number", default=0)

Review Comment:
   @kevinjqliu Thanks for spotting this! We definitely need to read 
snapshot.sequence_number as 0 for v1. However, as we have observed in the test 
outcome, making `sequence_number` default to 0 here leads to 
`sequence_number=0` be written to version 1 table metada's snapshots, which is 
not allowed by spec:
   ```
   Writing v1 metadata:
   Snapshot field sequence-number should not be written
   ```
   
   I think we may need a new 
[field_serializer](https://docs.pydantic.dev/latest/api/functional_serializers/#pydantic.functional_serializers.field_serializer)
 in `TableMetadataCommonFields` class or some other ways to correct the 
behavior on write. WDYT?



##########
pyiceberg/table/__init__.py:
##########
@@ -3827,6 +3827,40 @@ def _partition_summaries_to_rows(
             schema=manifest_schema,
         )
 
+    def metadata_log_entries(self) -> "pa.Table":
+        import pyarrow as pa
+
+        from pyiceberg.table.snapshots import MetadataLogEntry
+
+        table_schema = pa.schema([
+            pa.field("timestamp", pa.timestamp(unit="ms"), nullable=False),
+            pa.field("file", pa.string(), nullable=False),
+            pa.field("latest_snapshot_id", pa.int64(), nullable=True),
+            pa.field("latest_schema_id", pa.int32(), nullable=True),
+            pa.field("latest_sequence_number", pa.int64(), nullable=True),
+        ])
+
+        def metadata_log_entry_to_row(metadata_entry: MetadataLogEntry) -> 
Dict[str, Any]:
+            latest_snapshot = 
self.tbl.snapshot_as_of_timestamp(metadata_entry.timestamp_ms)
+            return {
+                "timestamp": metadata_entry.timestamp_ms,
+                "file": metadata_entry.metadata_file,
+                "latest_snapshot_id": latest_snapshot.snapshot_id if 
latest_snapshot else None,
+                "latest_schema_id": latest_snapshot.schema_id if 
latest_snapshot else None,
+                "latest_sequence_number": latest_snapshot.sequence_number if 
latest_snapshot else None,
+            }
+
+        # imitates `addPreviousFile` from Java
+        # 
https://github.com/apache/iceberg/blob/8248663a2a1ffddd2664ea37b45882455466f71c/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1450-L1451
+        metadata_log_entries = self.tbl.metadata.metadata_log + [
+            MetadataLogEntry(metadata_file=self.tbl.metadata_location, 
timestamp_ms=self.tbl.metadata.last_updated_ms)

Review Comment:
   It seems this line acts more like 
https://github.com/apache/iceberg/blob/8a70fe0ff5f241aec8856f8091c77fdce35ad256/core/src/main/java/org/apache/iceberg/MetadataLogEntriesTable.java#L62-L66.
   Just curious the reason that you mention `addPreviousFile` here, which seem 
to be more relevant when we update the metadata_log during table commit.
   
   BTW, this reminds me that currently non-rest catalog does not update the 
metadata_log field during commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Metadata Log Entries metadata table [iceberg-python]

Reply via email to