MrDerecho commented on issue #2130: URL: https://github.com/apache/iceberg-python/issues/2130#issuecomment-2997900165
@kevinjqliu, for context I am referring to Trino (Athena) tables can deal with duplicate files referenced in the metadata- other upstream consumers i.e. snowflake external iceberg and databricks uniform cannot, so if at some point during the batch loading with the pyiceberg add files a duplicate is loaded then these (upstream of pyiceberg iceberg tables) consumers can no longer consume or update the iceberg tables. Today, I run enterprise grade ETL with the add_files method, considering I have large homogenously partitioned parquet files created through a separate provider. The error messages snowflake and databricks provide are generic application errors but do provide the offending s3_uri. Basicallly, all I would need is a method that allows me to prune these s3_uri's from the manifest files and commit an update surgically to the table itself. Former methods of remedy include copying all files from the partition- deleting the partition, and reloading the partition (which depending can take sometime). The existing duplicate check- even using the better method would likely still take way too long between batches- as I use the same to reconcile my s3 objects with my tables and is time intensive. Right now, I am just using the existing add_files method and was hoping there was an easy way to traverse and remove the individual offending file from the metadata on the basis of existing methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org