sungwy commented on code in PR #1938: URL: https://github.com/apache/iceberg-python/pull/1938#discussion_r2070994443
########## pyiceberg/table/update/validate.py: ########## @@ -69,3 +75,74 @@ def validation_history( raise ValidationException("No matching snapshot found.") return manifests_files, snapshots + + +def deleted_data_files( + table: Table, + starting_snapshot: Snapshot, + data_filter: Optional[BooleanExpression], + parent_snapshot: Optional[Snapshot], + partition_set: Optional[set[Record]], +) -> Iterator[ManifestEntry]: + """Find deleted data files matching a filter since a starting snapshot. + + Args: + table: Table to validate + starting_snapshot: Snapshot current at the start of the operation + data_filter: Expression used to find deleted data files + partition_set: a set of partitions to find deleted data files + parent_snapshot: Ending snapshot on the branch being validated + + Returns: + List of deleted data files matching the filter + """ + # if there is no current table state, no files have been deleted + if parent_snapshot is None: + return + + manifests, snapshot_ids = validation_history( + table, + starting_snapshot, + parent_snapshot, + VALIDATE_DATA_FILES_EXIST_OPERATIONS, + ManifestContent.DATA, + ) + + if data_filter is not None: + evaluator = _StrictMetricsEvaluator(table.schema(), data_filter).eval Review Comment: I'm not too sure if this is correct, because `ManifestGroup.entries` seems to be using inclusive projection. Should we be using `inclusive_projection` here instead? Summoning @Fokko for a second review ########## pyiceberg/table/update/validate.py: ########## @@ -69,3 +75,74 @@ def validation_history( raise ValidationException("No matching snapshot found.") return manifests_files, snapshots + + +def deleted_data_files( + table: Table, + starting_snapshot: Snapshot, + data_filter: Optional[BooleanExpression], + parent_snapshot: Optional[Snapshot], + partition_set: Optional[set[Record]], Review Comment: This looks out of order: ```suggestion partition_set: Optional[set[Record]], parent_snapshot: Optional[Snapshot], ``` It looks like the function call in `validate_deleted_data_files` is assuming that `parent_snapshot` is the last argument ``` conflicting_entries = deleted_data_files(table, starting_snapshot, data_filter, None, parent_snapshot) ``` ########## pyiceberg/table/update/validate.py: ########## @@ -69,3 +75,74 @@ def validation_history( raise ValidationException("No matching snapshot found.") return manifests_files, snapshots + + +def deleted_data_files( + table: Table, + starting_snapshot: Snapshot, + data_filter: Optional[BooleanExpression], + parent_snapshot: Optional[Snapshot], + partition_set: Optional[set[Record]], +) -> Iterator[ManifestEntry]: + """Find deleted data files matching a filter since a starting snapshot. + + Args: + table: Table to validate + starting_snapshot: Snapshot current at the start of the operation + data_filter: Expression used to find deleted data files + partition_set: a set of partitions to find deleted data files + parent_snapshot: Ending snapshot on the branch being validated + + Returns: + List of deleted data files matching the filter + """ + # if there is no current table state, no files have been deleted + if parent_snapshot is None: + return + + manifests, snapshot_ids = validation_history( + table, + starting_snapshot, + parent_snapshot, Review Comment: This looks out of order to, because `validation_history` has `to_snapshot` as the second argument, and `from_snapshot` as the third argument. ```suggestion starting_snapshot, parent_snapshot, ``` Do you think it would be better to update `validation_history` function to use the following function signature instead? I think it's a lot more expected to have `from_snapshot` then `to_snapshot` ``` def validation_history( table: Table, from_snapshot: Snapshot, to_snapshot: Snapshot, matching_operations: set[Operation], manifest_content_filter: ManifestContent, ) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org