soumya-ghosh commented on issue #1053: URL: https://github.com/apache/iceberg-python/issues/1053#issuecomment-2350097049
From [spark docs](https://iceberg.apache.org/docs/latest/spark-queries/#all-metadata-tables), > These tables are unions of the metadata tables specific to the current snapshot, and return metadata across all snapshots. > The "all" metadata tables may produce more than one row per data file or manifest file because metadata files may be part of more than one table snapshot. So, here's my approach (pseudo-code): ```python metadata = load_table_metadata() for snapshot in metadata["snapshots"]: manifest_list = read manifest list from snapshot for manifest_file in manifest_list: manifest = read manifest file for file in manifest: process file (data_file or delete_file) ``` With this approach the number of files in output is much higher than the corresponding output of `all_files` table in Spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org