Re: [I] Duplicate File Remediation [iceberg-python]

via GitHub Mon, 23 Jun 2025 13:48:26 -0700


MrDerecho commented on issue #2130:
URL: 
https://github.com/apache/iceberg-python/issues/2130#issuecomment-2997900165


   @kevinjqliu, for context I am referring to Trino (Athena) tables can deal 
with duplicate files referenced in the metadata- other upstream consumers i.e. 
snowflake external iceberg and databricks uniform cannot, so if at some point 
during the batch loading with the pyiceberg add files a duplicate is loaded 
then these (upstream of pyiceberg iceberg tables) consumers can no longer 
consume or update the iceberg tables.  Today, I run enterprise grade ETL with 
the add_files method, considering I have large homogenously partitioned parquet 
files created through a separate provider.  The error messages snowflake and 
databricks provide are generic application errors but do provide the offending 
s3_uri.  
   
   Basicallly, all I would need is a method that allows me to prune these 
s3_uri's from the manifest files and commit an update surgically to the table 
itself.  Former methods of remedy include copying all files from the partition- 
deleting the partition, and reloading the partition (which depending can take 
sometime).   The existing duplicate check- even using the better method would 
likely still take way too long between batches- as I use the same to reconcile 
my s3 objects with my tables and is time intensive.  Right now, I am just using 
the existing add_files method and was hoping there was an easy way to traverse 
and remove the individual offending file from the metadata on the basis of 
existing methods. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Duplicate File Remediation [iceberg-python]

Reply via email to