sungwy commented on code in PR #1938:
URL: https://github.com/apache/iceberg-python/pull/1938#discussion_r2070994443


##########
pyiceberg/table/update/validate.py:
##########
@@ -69,3 +75,74 @@ def validation_history(
         raise ValidationException("No matching snapshot found.")
 
     return manifests_files, snapshots
+
+
+def deleted_data_files(
+    table: Table,
+    starting_snapshot: Snapshot,
+    data_filter: Optional[BooleanExpression],
+    parent_snapshot: Optional[Snapshot],
+    partition_set: Optional[set[Record]],
+) -> Iterator[ManifestEntry]:
+    """Find deleted data files matching a filter since a starting snapshot.
+
+    Args:
+        table: Table to validate
+        starting_snapshot: Snapshot current at the start of the operation
+        data_filter: Expression used to find deleted data files
+        partition_set: a set of partitions to find deleted data files
+        parent_snapshot: Ending snapshot on the branch being validated
+
+    Returns:
+        List of deleted data files matching the filter
+    """
+    # if there is no current table state, no files have been deleted
+    if parent_snapshot is None:
+        return
+
+    manifests, snapshot_ids = validation_history(
+        table,
+        starting_snapshot,
+        parent_snapshot,
+        VALIDATE_DATA_FILES_EXIST_OPERATIONS,
+        ManifestContent.DATA,
+    )
+
+    if data_filter is not None:
+        evaluator = _StrictMetricsEvaluator(table.schema(), data_filter).eval

Review Comment:
   I'm not too sure if this is correct, because `ManifestGroup.entries` seems 
to be using inclusive projection.
   
   Should we be using `inclusive_projection` here instead?
   
   Summoning @Fokko for a second review



##########
pyiceberg/table/update/validate.py:
##########
@@ -69,3 +75,74 @@ def validation_history(
         raise ValidationException("No matching snapshot found.")
 
     return manifests_files, snapshots
+
+
+def deleted_data_files(
+    table: Table,
+    starting_snapshot: Snapshot,
+    data_filter: Optional[BooleanExpression],
+    parent_snapshot: Optional[Snapshot],
+    partition_set: Optional[set[Record]],

Review Comment:
   This looks out of order:
   ```suggestion
       partition_set: Optional[set[Record]],
       parent_snapshot: Optional[Snapshot],
   ```
   
   It looks like the function call in `validate_deleted_data_files` is assuming 
that `parent_snapshot` is the last argument
   ```
       conflicting_entries = deleted_data_files(table, starting_snapshot, 
data_filter, None, parent_snapshot)
   ```



##########
pyiceberg/table/update/validate.py:
##########
@@ -69,3 +75,74 @@ def validation_history(
         raise ValidationException("No matching snapshot found.")
 
     return manifests_files, snapshots
+
+
+def deleted_data_files(
+    table: Table,
+    starting_snapshot: Snapshot,
+    data_filter: Optional[BooleanExpression],
+    parent_snapshot: Optional[Snapshot],
+    partition_set: Optional[set[Record]],
+) -> Iterator[ManifestEntry]:
+    """Find deleted data files matching a filter since a starting snapshot.
+
+    Args:
+        table: Table to validate
+        starting_snapshot: Snapshot current at the start of the operation
+        data_filter: Expression used to find deleted data files
+        partition_set: a set of partitions to find deleted data files
+        parent_snapshot: Ending snapshot on the branch being validated
+
+    Returns:
+        List of deleted data files matching the filter
+    """
+    # if there is no current table state, no files have been deleted
+    if parent_snapshot is None:
+        return
+
+    manifests, snapshot_ids = validation_history(
+        table,
+        starting_snapshot,
+        parent_snapshot,

Review Comment:
   This looks out of order to, because `validation_history` has `to_snapshot` 
as the second argument, and `from_snapshot` as the third argument.
   ```suggestion
           starting_snapshot,
           parent_snapshot,
   ```
   
   Do you think it would be better to update `validation_history` function to 
use the following function signature instead? I think it's a lot more expected 
to have `from_snapshot` then `to_snapshot`
   
   ```
   def validation_history(
       table: Table,
       from_snapshot: Snapshot,
       to_snapshot: Snapshot,
       matching_operations: set[Operation],
       manifest_content_filter: ManifestContent,
   )
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to