Fokko commented on code in PR #139: URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1412913056
########## pyiceberg/table/__init__.py: ########## @@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate): removals: List[str] +class TableMetadataUpdateContext: + updates: List[TableUpdate] + last_added_schema_id: Optional[int] Review Comment: Thanks for the thorough explanation. > In UpdateSchema.commit(): We should check if new_schema is identical to any existing schemas before adding an AddSchema and a SetCurrentSchema. If they are identical, only a SetCurrentSchema(existing_schema_id) is necessary. Yes, this makes a lot of sense to me. > Regarding lastAddedSchemaId: It’s crucial to ensure that if it's not None, it points to a schema added earlier in the current transaction. Since Python transactions contain only one AddSchema, this should be straightforward. We could also consider adding a uniqueness check in update_table_metadata to confirm that the updates are from a single transaction. Also, instead of having a lastAddedSchemaId, we may add a method called last_added_schema() to extract the last added schema from the update context. This also makes sense to me. However one thing to note. The `lastAddedSchemaId` sounds to me like an implementation detail from Java. In PyIceberg the current situation is simpler as you explained, so we could just do `max(schema.id from tbl.schemas) + 1`. You sparked my interest here, let me double check this on the REST side as well 👍 ########## pyiceberg/table/__init__.py: ########## @@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate): removals: List[str] +class TableMetadataUpdateContext: + updates: List[TableUpdate] + last_added_schema_id: Optional[int] Review Comment: Thanks for the thorough explanation. > In UpdateSchema.commit(): We should check if new_schema is identical to any existing schemas before adding an AddSchema and a SetCurrentSchema. If they are identical, only a SetCurrentSchema(existing_schema_id) is necessary. Yes, this makes a lot of sense to me. > Regarding lastAddedSchemaId: It’s crucial to ensure that if it's not None, it points to a schema added earlier in the current transaction. Since Python transactions contain only one AddSchema, this should be straightforward. We could also consider adding a uniqueness check in update_table_metadata to confirm that the updates are from a single transaction. Also, instead of having a lastAddedSchemaId, we may add a method called last_added_schema() to extract the last added schema from the update context. This also makes sense to me. However one thing to note. The `lastAddedSchemaId` sounds to me like an implementation detail from Java. In PyIceberg the current situation is simpler as you explained, so we could just do `max(schema.id from tbl.schemas) + 1`. You sparked my interest here, let me double check this on the REST side as well 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org