rdblue commented on code in PR #6323:
URL: https://github.com/apache/iceberg/pull/6323#discussion_r1217020366


##########
python/pyiceberg/table/__init__.py:
##########
@@ -69,21 +72,313 @@
     import ray
     from duckdb import DuckDBPyConnection
 
+    from pyiceberg.catalog import Catalog
 
 ALWAYS_TRUE = AlwaysTrue()
 
 
+class TableUpdates:
+    _table: Table
+    _updates: Tuple[TableUpdate, ...]
+    _requirements: Tuple[TableRequirement, ...]
+
+    def __init__(
+        self,
+        table: Table,
+        actions: Optional[Tuple[TableUpdate, ...]] = None,
+        requirements: Optional[Tuple[TableRequirement, ...]] = None,
+    ):
+        self._table = table
+        self._updates = actions or ()
+        self._requirements = requirements or ()
+
+    def _append_updates(self, *new_updates: TableUpdate) -> TableUpdates:
+        """Appends updates to the set of staged updates
+
+        Args:
+            *new_updates: Any new updates
+
+        Raises:
+            ValueError: When the type of update is not unique.
+
+        Returns:
+            A new AlterTable object with the new updates appended
+        """
+        for new_update in new_updates:
+            type_new_update = type(new_update)
+            if any(type(update) == type_new_update for update in 
self._updates):
+                raise ValueError(f"Updates in a single commit need to be 
unique, duplicate: {type_new_update}")

Review Comment:
   It looks like this class is attempting to behave like a transaction because 
it will stack up a set of changes and commit them all at once. That seems 
reasonable but then we get strange cases like this where there are odd 
restrictions. This would definitely happen because the changes for a real 
transaction would commonly include more than one `AddSnapshot` updates, but 
just one `SetRefSnapshotId` update.
   
   I think this is also going to hit an issue with complex changes, like 
`UpdateSchema`. That changes supports multiple calls and then results in a 
finished schema that is sent using `AddSchema` and `SetCurrentSchemaId` 
updates. For the API, this would either need to include all of the schema 
change methods here -- which will get ugly really fast -- or we need a way to 
have a `UpdateSchema` API that returns back to the overall transaction API.
   
   In Java, we took the second approach. There's a common `UpdateSchema` API 
that can be performed as a single operation on a table 
(`table.updateSchema().addColumn("x", IntType.get()).commit()`) or combined 
with others in a transaction. (`table.newTransaction()` / 
`transaction.updateSchema().commit()` / `transaction.commitTransaction()`).
   
   I suspect that we want to do the same thing here and have some kind of 
transaction that accumulates changes from other more specific APIs.
   
   It looks like the issue with this PR is trying to combine the transaction 
object that accumulates changes and calls `catalog.commit_table` with the 
public APIs for making changes to a table. I think I would take the same 
approach as Java and have a `Transaction` object to represent multiple changes 
to a table, but I would hide that from users in most cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to