Fokko commented on code in PR #1388: URL: https://github.com/apache/iceberg-python/pull/1388#discussion_r1863537090
########## pyiceberg/table/__init__.py: ########## @@ -1493,6 +1496,13 @@ def to_ray(self) -> ray.data.dataset.Dataset: return ray.data.from_arrow(self.to_arrow()) + def count(self) -> int: + res = 0 + tasks = self.plan_files() Review Comment: Yes, this can be widely off, not just because the merge-on-read deletes, but because `plan_files` returns all the files that (might) contain relevant rows. For example, if it cannot be determined if has relevant data, it will be returned by `plan_files`. I think there are two ways forward: - One is similar to how we handle deletes. For deletes, we check if the whole file matches, if this is the case, then we can simply drop the file from the metadata. You can find the code [here](https://github.com/apache/iceberg-python/blob/5bef1bfe677df7016dc37b8db87650c34faceb7a/pyiceberg/table/update/snapshot.py#L359-L427). If a file fully matches, is is valid to use `task.file.record_count`. We would need to extend this to also see if there are also merge-on-read deletes as Kevin already mentioned, or just fail when there are positional deletes. - A cleaner option, but this is a bit more work, but pretty exciting, would be to include the `residual-predicate` in the `FileScanTask`. When we run a query, like `day_added = 2024-12-01 and user_id = 10`, then the `day_added = 2024-12-01` might be satisfied with the partitioning already. This is the case when the table is partitioned by day, and we know that all the data in the file evaluates `true` for `day_added = 2024-12-01`, then we need to open the file, and filter for `user_id = 10`. If we would leave out the `user_id = 10`, then it would be `ALWAYS_TRUE`, and then we know that we can just use `task.file.record_count`. This way we could very easily loop over the `.plan_files()`: ```python def count(self) -> int: res = 0 tasks = self.plan_files() for task in tasks: if task.residual == ALWAYS_TRUE and len(task.delete_files): res += task.file.record_count else: # pseudocode, open the table, and apply the filter and deletes res += len(_table_from_scan_task(task)) return res ``` To get to the second step, we first have to port the `ResidualEvaluator`. The java code can be found [here](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/ResidualEvaluator.java), including some [excellent tests](https://github.com/apache/iceberg/blob/main/api/src/test/java/org/apache/iceberg/transforms/TestResiduals.java). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org