Re: [I] Duplicate File Remediation [iceberg-python]

via GitHub Sun, 22 Jun 2025 17:47:16 -0700


kevinjqliu commented on issue #2130:
URL: 
https://github.com/apache/iceberg-python/issues/2130#issuecomment-2994346349


   @MrDerecho @ForeverAngry can you help me understand the use case and 
expected behavior? 
   
   > on occasion, there will be a duplicate file, I load so many files that I 
can't use the duplicate check due to performance constraints (500-2000 files at 
a time).
   
   The duplicate check is an anti-join between the "files to be added" to the 
table and the existing data files of the table. The existing data files is from 
reading all the manifest list and manifest files. I think the real bottleneck 
here might be from reading the manifests. 
   
   Is that what youre seeing? 
   
   > Note: AWS Athena doesn't have an issue with duplicate loads, however 
upstream Snowflake external tables and Databricks Delta uniform tables do.
   
   Does Athena have the add_files feature? I'm also curious what the issue with 
duplicate loads is in Snowflake external tables and Databricks Delta uniform 
tables.
   
   > Often time the error messages provided have the S3 URI regarding the 
impacted file- I just need a process to surgically de-dupe the file in question 
from the metadata in-situ
   
   What is the error exactly? 503? 
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Duplicate File Remediation [iceberg-python]

Reply via email to