Re: [I] Duplicate File Remediation [iceberg-python]

2025-06-25 Thread via GitHub
ForeverAngry commented on issue #2130: URL: https://github.com/apache/iceberg-python/issues/2130#issuecomment-3006410326 So @MrDerecho , are you saying that you are looking for a function that can remove a `DataFile` entry, and then create a new snapshot with an updated `ManifestFile`?

Re: [I] Duplicate File Remediation [iceberg-python]

2025-06-23 Thread via GitHub
MrDerecho commented on issue #2130: URL: https://github.com/apache/iceberg-python/issues/2130#issuecomment-2997900165 @kevinjqliu, for context I am referring to Trino (Athena) tables can deal with duplicate files referenced in the metadata- other upstream consumers i.e. snowflake external

Re: [I] Duplicate File Remediation [iceberg-python]

2025-06-23 Thread via GitHub
jayceslesar commented on issue #2130: URL: https://github.com/apache/iceberg-python/issues/2130#issuecomment-2994344644 I think also the more files you can add in a single call for the `file_paths` argument, the more performant it will be as we have to re-compute the known data files for t

Re: [I] Duplicate File Remediation [iceberg-python]

2025-06-22 Thread via GitHub
kevinjqliu commented on issue #2130: URL: https://github.com/apache/iceberg-python/issues/2130#issuecomment-2994346349 @MrDerecho @ForeverAngry can you help me understand the use case and expected behavior? > on occasion, there will be a duplicate file, I load so many files that I

Re: [I] Duplicate File Remediation [iceberg-python]

2025-06-22 Thread via GitHub
kevinjqliu commented on issue #2130: URL: https://github.com/apache/iceberg-python/issues/2130#issuecomment-2994346640 I think a pseducode snippet of how youre using add_files would be really helpful here! -- This is an automated message from the Apache Git Service. To respond to the me

Re: [I] Duplicate File Remediation [iceberg-python]

2025-06-21 Thread via GitHub
jayceslesar commented on issue #2130: URL: https://github.com/apache/iceberg-python/issues/2130#issuecomment-2993695822 Looks like the performance hit comes from https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L850 -- This is an automated message from the A

Re: [I] Duplicate File Remediation [iceberg-python]

2025-06-20 Thread via GitHub
ForeverAngry commented on issue #2130: URL: https://github.com/apache/iceberg-python/issues/2130#issuecomment-2993011282 Thanks for raising this @MrDerecho , this is something my team members have to deal with frequently, due to how we approach the use of `add_files`. Nice to know that it

[I] Duplicate File Remediation [iceberg-python]

2025-06-20 Thread via GitHub
MrDerecho opened a new issue, #2130: URL: https://github.com/apache/iceberg-python/issues/2130 ### Feature Request / Improvement I use pyiceberg add_files to perform enterprise-grade ETL loading and backfilling of iceberg tables- on occasion, there will be a duplicate file, I load so