Fokko commented on code in PR #506: URL: https://github.com/apache/iceberg-python/pull/506#discussion_r1524358528
########## pyiceberg/table/__init__.py: ########## @@ -1147,6 +1150,26 @@ def overwrite(self, df: pa.Table, overwrite_filter: BooleanExpression = ALWAYS_T for data_file in data_files: update_snapshot.append_data_file(data_file) + def add_files(self, file_paths: List[str]) -> None: + """ + Shorthand API for adding files as data files to the table. + + Args: + file_paths: The list of full file paths to be added as data files to the table + """ + if any(not isinstance(field.transform, IdentityTransform) for field in self.metadata.spec().fields): + raise NotImplementedError("Cannot add_files to a table with Transform Partitions") Review Comment: We can be more permissive. It isn't a problem the table's current partitioning has something different than a `IdentitiyTransform`, the issue is that we cannot add DataFiles that use this partitioning (until we find a clever way of checking this). ########## pyiceberg/table/__init__.py: ########## @@ -1147,6 +1150,26 @@ def overwrite(self, df: pa.Table, overwrite_filter: BooleanExpression = ALWAYS_T for data_file in data_files: update_snapshot.append_data_file(data_file) + def add_files(self, file_paths: List[str]) -> None: + """ + Shorthand API for adding files as data files to the table. + + Args: + file_paths: The list of full file paths to be added as data files to the table + """ Review Comment: It would be great to add a `Raises:` section here indicating which errors to expect. For example, when a file cannot be found. In such a case, we want to raise a PyIceberg exception, instead of an Arrow specific exception. ########## pyiceberg/table/__init__.py: ########## @@ -1147,6 +1150,26 @@ def overwrite(self, df: pa.Table, overwrite_filter: BooleanExpression = ALWAYS_T for data_file in data_files: update_snapshot.append_data_file(data_file) + def add_files(self, file_paths: List[str]) -> None: + """ + Shorthand API for adding files as data files to the table. + + Args: + file_paths: The list of full file paths to be added as data files to the table + """ + if any(not isinstance(field.transform, IdentityTransform) for field in self.metadata.spec().fields): + raise NotImplementedError("Cannot add_files to a table with Transform Partitions") + + if self.name_mapping() is None: Review Comment: Technically you don't have to add a name-mapping if the field-IDs are set ########## pyiceberg/table/__init__.py: ########## @@ -1147,6 +1150,26 @@ def overwrite(self, df: pa.Table, overwrite_filter: BooleanExpression = ALWAYS_T for data_file in data_files: update_snapshot.append_data_file(data_file) + def add_files(self, file_paths: List[str]) -> None: + """ + Shorthand API for adding files as data files to the table. + + Args: + file_paths: The list of full file paths to be added as data files to the table + """ + if any(not isinstance(field.transform, IdentityTransform) for field in self.metadata.spec().fields): + raise NotImplementedError("Cannot add_files to a table with Transform Partitions") + + if self.name_mapping() is None: + with self.transaction() as tx: + tx.set_properties(**{TableProperties.DEFAULT_NAME_MAPPING: self.schema().name_mapping.model_dump_json()}) + + with self.transaction() as txn: + with txn.update_snapshot().fast_append() as update_snapshot: Review Comment: Now with https://github.com/apache/iceberg-python/pull/471 merged, this should work in a single transaction. The updated metadata will be passed into the UpdateSnapshot class and should pick up the name-mapping. ```suggestion with tx.update_snapshot().fast_append() as update_snapshot: ``` I think it is important to have this operation in a single transaction, otherwise, the name mapping might be set, and then if a file is missing, it will fail and the name-mapping will still be there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org