Re: [PR] Update table metadata [iceberg-python]

via GitHub Sat, 02 Dec 2023 18:05:37 -0800


HonahX commented on code in PR #139:
URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1412909663



##########
pyiceberg/table/__init__.py:
##########
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
     removals: List[str]
 
 
+class TableMetadataUpdateContext:
+    updates: List[TableUpdate]
+    last_added_schema_id: Optional[int]

Review Comment:
   Thanks for sharing the example! I'd like to clarify some points to ensure I 
fully understand your perspective. According to [Java 
ref](https://github.com/apache/iceberg/blob/2aac63688054ac36540072cfa13aac08fccf9421/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1449-L1489),
 It appears that `lastAddedSchemaId` refers to the most recently added schema 
in the **current** transaction.
   
   The purpose of `lastAddedSchemaId `in Java is to support multiple 
`AddSchema` updates within a single transaction, allowing for efficient 
tracking of the latest schema addition. This is particularly beneficial if a 
schema added later in the transaction is the same as one added earlier. 
However, in Python, where we limit to one `AddSchema` per transaction, 
`lastAddedSchemaId` should either be the ID of the added schema or None, 
specifically when the added schema is identical to existing ones in the 
metadata.
   
   The example you provided brings up an important consideration for our Python 
implementation: if the result schema of `UpdateSchema` is identical to a 
previously added schema in base_metadata, we should not add this new schema to 
the metadata. Instead, we only need to set the current schema ID to that of the 
previously existing one.
   In Java, metadata builder will suppress the `AddSchema` update if the new 
schema is a duplicate of any previous schemas. Inspired by this, I think we can 
implement a similar mechanism in `UpdateSchema.commit()`. If the new schema is 
identical to an existing one, only a single `setCurrentSchema` should be added.
   
   In conclusion, to enhance our approach to schema updates, I propose the 
following:
   1. In `UpdateSchema.commit()`: We should check if new_schema is identical to 
any existing schemas before adding an `AddSchema` and a `SetCurrentSchema`. If 
they are identical, only a `SetCurrentSchema(existing_schema_id)` is necessary.
   2. Regarding `lastAddedSchemaId`: It’s crucial to ensure that if it's not 
None, it points to a schema added earlier in the current transaction. Since 
Python transactions contain only one AddSchema, this should be straightforward. 
We could also consider adding a uniqueness check in update_table_metadata to 
confirm that the updates are from a single transaction. Also, instead of having 
a `lastAddedSchemaId`, we may add a method called `last_added_schema()` to 
extract the last added schema from the update context.
   
   What do you think – do the above reasoning and proposed changes seem good to 
you?
   
   Thanks!



##########
pyiceberg/table/__init__.py:
##########
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
     removals: List[str]
 
 
+class TableMetadataUpdateContext:
+    updates: List[TableUpdate]
+    last_added_schema_id: Optional[int]

Review Comment:
   Thanks for sharing the example! I'd like to clarify some points to ensure I 
fully understand your perspective. According to [Java 
ref](https://github.com/apache/iceberg/blob/2aac63688054ac36540072cfa13aac08fccf9421/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1449-L1489),
 It appears that `lastAddedSchemaId` refers to the most recently added schema 
in the **current** transaction.
   
   The purpose of `lastAddedSchemaId `in Java is to support multiple 
`AddSchema` updates within a single transaction, allowing for efficient 
tracking of the latest schema addition. This is particularly beneficial if a 
schema added later in the transaction is the same as one added earlier. 
However, in Python, where we limit to one `AddSchema` per transaction, 
`lastAddedSchemaId` should either be the ID of the added schema or None, 
specifically when the added schema is identical to existing ones in the 
metadata.
   
   The example you provided brings up an important consideration for our Python 
implementation: if the result schema of `UpdateSchema` is identical to a 
previously added schema in base_metadata, we should not add this new schema to 
the metadata. Instead, we only need to set the current schema ID to that of the 
previously existing one.
   In Java, metadata builder will suppress the `AddSchema` update if the new 
schema is a duplicate of any previous schemas. Inspired by this, I think we can 
implement a similar mechanism in `UpdateSchema.commit()`. If the new schema is 
identical to an existing one, only a single `setCurrentSchema` should be added.
   
   In conclusion, to enhance our approach to schema updates, I propose the 
following:
   1. In `UpdateSchema.commit()`: We should check if new_schema is identical to 
any existing schemas before adding an `AddSchema` and a `SetCurrentSchema`. If 
they are identical, only a `SetCurrentSchema(existing_schema_id)` is necessary.
   2. Regarding `lastAddedSchemaId`: It’s crucial to ensure that if it's not 
None, it points to a schema added earlier in the current transaction. Since 
Python transactions contain only one AddSchema, this should be straightforward. 
We could also consider adding a uniqueness check in update_table_metadata to 
confirm that the updates are from a single transaction. Also, instead of having 
a `lastAddedSchemaId`, we may add a method called `last_added_schema()` to 
extract the last added schema from the update context.
   
   What do you think – do the above reasoning and proposed changes seem good to 
you?
   
   Thanks!



##########
pyiceberg/table/__init__.py:
##########
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
     removals: List[str]
 
 
+class TableMetadataUpdateContext:
+    updates: List[TableUpdate]
+    last_added_schema_id: Optional[int]

Review Comment:
   Thanks for sharing the example! I'd like to clarify some points to ensure I 
fully understand your perspective. According to [Java 
ref](https://github.com/apache/iceberg/blob/2aac63688054ac36540072cfa13aac08fccf9421/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1449-L1489),
 It appears that `lastAddedSchemaId` refers to the most recently added schema 
in the **current** transaction.
   
   The purpose of `lastAddedSchemaId `in Java is to support multiple 
`AddSchema` updates within a single transaction, allowing for efficient 
tracking of the latest schema addition. This is particularly beneficial if a 
schema added later in the transaction is the same as one added earlier. 
However, in Python, where we limit to one `AddSchema` per transaction, 
`lastAddedSchemaId` should either be the ID of the added schema or None, 
specifically when the added schema is identical to existing ones in the 
metadata.
   
   The example you provided brings up an important consideration for our Python 
implementation: if the result schema of `UpdateSchema` is identical to a 
previously added schema in base_metadata, we should not add this new schema to 
the metadata. Instead, we only need to set the current schema ID to that of the 
previously existing one.
   In Java, metadata builder will suppress the `AddSchema` update if the new 
schema is a duplicate of any previous schemas. Inspired by this, I think we can 
implement a similar mechanism in `UpdateSchema.commit()`. If the new schema is 
identical to an existing one, only a single `setCurrentSchema` should be added.
   
   In conclusion, to enhance our approach to schema updates, I propose the 
following:
   1. In `UpdateSchema.commit()`: We should check if new_schema is identical to 
any existing schemas before adding an `AddSchema` and a `SetCurrentSchema`. If 
they are identical, only a `SetCurrentSchema(existing_schema_id)` is necessary.
   2. Regarding `lastAddedSchemaId`: It’s crucial to ensure that if it's not 
None, it points to a schema added earlier in the current transaction. Since 
Python transactions contain only one AddSchema, this should be straightforward. 
We could also consider adding a uniqueness check in update_table_metadata to 
confirm that the updates are from a single transaction. Also, instead of having 
a `lastAddedSchemaId`, we may add a method called `last_added_schema()` to 
extract the last added schema from the update context.
   
   What do you think – do the above reasoning and proposed changes seem good to 
you?
   
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Update table metadata [iceberg-python]

Reply via email to