soumya-ghosh commented on issue #1053:
URL: 
https://github.com/apache/iceberg-python/issues/1053#issuecomment-2350097049

   From [spark 
docs](https://iceberg.apache.org/docs/latest/spark-queries/#all-metadata-tables),
 
   
   > These tables are unions of the metadata tables specific to the current 
snapshot, and return metadata across all snapshots.
   > The "all" metadata tables may produce more than one row per data file or 
manifest file because metadata files may be part of more than one table 
snapshot.
   
   So, here's my approach (pseudo-code):
   ```python
   metadata = load_table_metadata()
   for snapshot in metadata["snapshots"]:
       manifest_list = read manifest list from snapshot
       for manifest_file in manifest_list:
           manifest = read manifest file
           for file in manifest:
               process file (data_file or delete_file)
   ```
   
   With this approach the number of files in output is much higher than the 
corresponding output of `all_files` table in Spark.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to