HonahX commented on code in PR #139: URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1412909663
########## pyiceberg/table/__init__.py: ########## @@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate): removals: List[str] +class TableMetadataUpdateContext: + updates: List[TableUpdate] + last_added_schema_id: Optional[int] Review Comment: Thanks for sharing the example! I'd like to clarify some points to ensure I fully understand your perspective. According to [Java ref](https://github.com/apache/iceberg/blob/2aac63688054ac36540072cfa13aac08fccf9421/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1449-L1489), It appears that `lastAddedSchemaId` refers to the most recently added schema in the **current** transaction. The purpose of `lastAddedSchemaId `in Java is to support multiple `AddSchema` updates within a single transaction, allowing for efficient tracking of the latest schema addition. This is particularly beneficial if a schema added later in the transaction is the same as one added earlier. However, in Python, where we limit to one `AddSchema` per transaction, `lastAddedSchemaId` should either be the ID of the added schema or None, specifically when the added schema is identical to existing ones in the metadata. The example you provided brings up an important consideration for our Python implementation: if the result schema of `UpdateSchema` is identical to a previously added schema in base_metadata, we should not add this new schema to the metadata. Instead, we only need to set the current schema ID to that of the previously existing one. In Java, metadata builder will suppress the `AddSchema` update if the new schema is a duplicate of any previous schemas. Inspired by this, I think we can implement a similar mechanism in `UpdateSchema.commit()`. If the new schema is identical to an existing one, only a single `setCurrentSchema` should be added. In conclusion, to enhance our approach to schema updates, I propose the following: 1. In `UpdateSchema.commit()`: We should check if new_schema is identical to any existing schemas before adding an `AddSchema` and a `SetCurrentSchema`. If they are identical, only a `SetCurrentSchema(existing_schema_id)` is necessary. 2. Regarding `lastAddedSchemaId`: It’s crucial to ensure that if it's not None, it points to a schema added earlier in the current transaction. Since Python transactions contain only one AddSchema, this should be straightforward. We could also consider adding a uniqueness check in update_table_metadata to confirm that the updates are from a single transaction. Also, instead of having a `lastAddedSchemaId`, we may add a method called `last_added_schema()` to extract the last added schema from the update context. What do you think – do the above reasoning and proposed changes seem good to you? Thanks! ########## pyiceberg/table/__init__.py: ########## @@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate): removals: List[str] +class TableMetadataUpdateContext: + updates: List[TableUpdate] + last_added_schema_id: Optional[int] Review Comment: Thanks for sharing the example! I'd like to clarify some points to ensure I fully understand your perspective. According to [Java ref](https://github.com/apache/iceberg/blob/2aac63688054ac36540072cfa13aac08fccf9421/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1449-L1489), It appears that `lastAddedSchemaId` refers to the most recently added schema in the **current** transaction. The purpose of `lastAddedSchemaId `in Java is to support multiple `AddSchema` updates within a single transaction, allowing for efficient tracking of the latest schema addition. This is particularly beneficial if a schema added later in the transaction is the same as one added earlier. However, in Python, where we limit to one `AddSchema` per transaction, `lastAddedSchemaId` should either be the ID of the added schema or None, specifically when the added schema is identical to existing ones in the metadata. The example you provided brings up an important consideration for our Python implementation: if the result schema of `UpdateSchema` is identical to a previously added schema in base_metadata, we should not add this new schema to the metadata. Instead, we only need to set the current schema ID to that of the previously existing one. In Java, metadata builder will suppress the `AddSchema` update if the new schema is a duplicate of any previous schemas. Inspired by this, I think we can implement a similar mechanism in `UpdateSchema.commit()`. If the new schema is identical to an existing one, only a single `setCurrentSchema` should be added. In conclusion, to enhance our approach to schema updates, I propose the following: 1. In `UpdateSchema.commit()`: We should check if new_schema is identical to any existing schemas before adding an `AddSchema` and a `SetCurrentSchema`. If they are identical, only a `SetCurrentSchema(existing_schema_id)` is necessary. 2. Regarding `lastAddedSchemaId`: It’s crucial to ensure that if it's not None, it points to a schema added earlier in the current transaction. Since Python transactions contain only one AddSchema, this should be straightforward. We could also consider adding a uniqueness check in update_table_metadata to confirm that the updates are from a single transaction. Also, instead of having a `lastAddedSchemaId`, we may add a method called `last_added_schema()` to extract the last added schema from the update context. What do you think – do the above reasoning and proposed changes seem good to you? Thanks! ########## pyiceberg/table/__init__.py: ########## @@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate): removals: List[str] +class TableMetadataUpdateContext: + updates: List[TableUpdate] + last_added_schema_id: Optional[int] Review Comment: Thanks for sharing the example! I'd like to clarify some points to ensure I fully understand your perspective. According to [Java ref](https://github.com/apache/iceberg/blob/2aac63688054ac36540072cfa13aac08fccf9421/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1449-L1489), It appears that `lastAddedSchemaId` refers to the most recently added schema in the **current** transaction. The purpose of `lastAddedSchemaId `in Java is to support multiple `AddSchema` updates within a single transaction, allowing for efficient tracking of the latest schema addition. This is particularly beneficial if a schema added later in the transaction is the same as one added earlier. However, in Python, where we limit to one `AddSchema` per transaction, `lastAddedSchemaId` should either be the ID of the added schema or None, specifically when the added schema is identical to existing ones in the metadata. The example you provided brings up an important consideration for our Python implementation: if the result schema of `UpdateSchema` is identical to a previously added schema in base_metadata, we should not add this new schema to the metadata. Instead, we only need to set the current schema ID to that of the previously existing one. In Java, metadata builder will suppress the `AddSchema` update if the new schema is a duplicate of any previous schemas. Inspired by this, I think we can implement a similar mechanism in `UpdateSchema.commit()`. If the new schema is identical to an existing one, only a single `setCurrentSchema` should be added. In conclusion, to enhance our approach to schema updates, I propose the following: 1. In `UpdateSchema.commit()`: We should check if new_schema is identical to any existing schemas before adding an `AddSchema` and a `SetCurrentSchema`. If they are identical, only a `SetCurrentSchema(existing_schema_id)` is necessary. 2. Regarding `lastAddedSchemaId`: It’s crucial to ensure that if it's not None, it points to a schema added earlier in the current transaction. Since Python transactions contain only one AddSchema, this should be straightforward. We could also consider adding a uniqueness check in update_table_metadata to confirm that the updates are from a single transaction. Also, instead of having a `lastAddedSchemaId`, we may add a method called `last_added_schema()` to extract the last added schema from the update context. What do you think – do the above reasoning and proposed changes seem good to you? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org