rdblue commented on code in PR #41: URL: https://github.com/apache/iceberg-python/pull/41#discussion_r1429240582
########## pyiceberg/table/__init__.py: ########## @@ -830,6 +884,49 @@ def history(self) -> List[SnapshotLogEntry]: def update_schema(self, allow_incompatible_changes: bool = False, case_sensitive: bool = True) -> UpdateSchema: return UpdateSchema(self, allow_incompatible_changes=allow_incompatible_changes, case_sensitive=case_sensitive) + def write_arrow(self, df: pa.Table, mode: Literal['append', 'overwrite'] = 'overwrite') -> None: Review Comment: I'm wondering what API we want to expose here. Do we want a `write_arrow` method or should we just accept a `df` and test whether it is an acceptable type? Spark used a similar API (`write` with a `mode` argument) that we ended up replacing with one that is more specific to avoid confusion. In the new API, we used more specific verbs like `append`, `overwrite_by_partition`, `overwrite_by_filter`, `replace`, etc. One of the problems was that it wasn't clear what the behavior of `overwrite` would be: it could replace the schema and the data or it could replace only the data. It's a little more clear here because this is passing data to the table instance; I think it would be reasonable to assume that the table schema is enforced. However, I don't think a `mode` is quite enough to cover what to overwrite: whether it is the entire table or just parts of it. What about a slightly different API that uses method names for the verb? Something like this: ```python def append(self, df: pa.Table) -> None: """Appends the contents of a dataframe to the table""" pass def overwrite(self, df: pa.Table, overwrite_filter: Expression=AlwaysTrue()) -> None: """Overwrites the data matching overwrite_filter with the contents of a dataframe""" if overwrite_filter != AlwaysTrue: raise NotImplementedError("Cannot overwrite a subset of a table") if len(self.spec().fields) > 0: raise NotImplementedError("Cannot write to a partitioned table") ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org