paul-bormans-pcgw opened a new issue, #11695: URL: https://github.com/apache/iceberg/issues/11695
### Apache Iceberg version 1.6.1 ### Query engine Trino ### Please describe the bug 🐞 I'm running iceberg on a compose setup and have 2 concurrent writers: 1) doing appends using pyIceberg 2) doing a DELETE query using Trino + followed by a expire-snapshots also using Trino. I'm using the following properties when creating the table: ``` table = self.catalog.create_table( identifier=..., schema=..., properties={ "gc.enabled": True, "commit.retry.num-retries": 4, "write.delete.isolation-level": "snapshot", "write.update.isolation-level": "snapshot", "write.merge.isolation-level": "snapshot", }, ``` Also I'm using just a JDBC catalog, for instance the Trino connector config: ``` connector.name=iceberg iceberg.catalog.type=jdbc iceberg.jdbc-catalog.catalog-name=sql iceberg.jdbc-catalog.driver-class=org.postgresql.Driver iceberg.jdbc-catalog.connection-url=jdbc:postgresql://postgres:5432/catalog iceberg.jdbc-catalog.connection-user=postgres iceberg.jdbc-catalog.connection-password=postgres iceberg.jdbc-catalog.default-warehouse-dir=s3://demobucket fs.native-s3.enabled=true s3.endpoint=http\://minio\:9000/ s3.path-style-access=true s3.region=us-east-1 s3.aws-access-key=minioadmin s3.aws-secret-key=minioadmin iceberg.expire-snapshots.min-retention=2h iceberg.remove-orphan-files.min-retention=1h ``` pyIceberg is committing new data (FastAppend) every few second; for instance: ``` { "snapshot-id": 2401014885715513300, "parent-snapshot-id": 5514772428877076000, "sequence-number": 3783, "timestamp-ms": 1733304538706, "manifest-list": "s3://demobucket/ts.db/pack/metadata/snap-2401014885715513233-0-2609e33f-b31d-4425-bcc6-bd074de3012f.avro", "summary": { "operation": "append", "added-files-size": "18618432", "added-data-files": "1", "added-records": "107768", "changed-partition-count": "1", "total-data-files": "3761", "total-delete-files": "3399", "total-records": "438076882", "total-files-size": "72715662134", "total-position-deletes": "393434960", "total-equality-deletes": "0" }, "schema-id": 0 }, { "snapshot-id": 5605442414867702000, "parent-snapshot-id": 2401014885715513300, "sequence-number": 3784, "timestamp-ms": 1733304550871, "manifest-list": "s3://demobucket/ts.db/pack/metadata/snap-5605442414867701673-0-c0eef6ae-41b7-45ee-a777-299279b632ea.avro", "summary": { "operation": "append", ``` To cleanup older data we run following Query: ``` DELETE FROM pack WHERE epoch_timestamp_tz <= timestamp '2024-12-?? ??:??' AND timestampns < n ``` This correctly creates a delete commit; for instance: ``` { "snapshot-id": 3134131428617513500, "parent-snapshot-id": 2181664209442251000, "sequence-number": 3631, "timestamp-ms": 1733302382646, "manifest-list": "s3://demobucket/ts.db/pack/metadata/snap-3134131428617513346-2-43520ef4-077e-4b06-a8db-85c8f2c12e43.avro", "summary": { "operation": "delete", "trino_query_id": "20241204_084910_00399_5fm9y", "added-position-delete-files": "177", "added-delete-files": "177", "added-files-size": "27066930", "added-position-deletes": "20433177", "changed-partition-count": "1", "total-records": "420753387", "total-files-size": "69857546227", "total-data-files": "3611", "total-delete-files": "3219", "total-position-deletes": "372374521", "total-equality-deletes": "0", "iceberg-version": "Apache Iceberg 1.6.1 (commit 8e9d59d299be42b0bca9461457cd1e95dbaad086)" }, "schema-id": 0 }, ``` After the delete query we run expire-snapshots to cleanup old snapshots AND old datafiles that were removed by delete-operations earlier; for instance: ``` ALTER TABLE pack EXECUTE expire_snapshots(retention_threshold => '3h') ``` From the Trino logging I can see snapshots get expired and also delete-operation (snapshots) are expired / removed BUT none of the actual data files are removed? What are we missing here? ``` org.apache.iceberg.RemoveSnapshots Expiring snapshots older than: 2024-12-04T03:59:01.303+00:00 (1733284741303) org.apache.iceberg.RemoveSnapshots Committed snapshot changes org.apache.iceberg.RemoveSnapshots Cleaning up expired files (local, incremental) org.apache.iceberg.IncrementalFileCleanup Expired snapshot: BaseSnapshot{id=4579799571894291545, timestamp_ms=1733284090666, operation=append, summary={added-files-size=23432710, added-data-files=1, added-records=131666, changed-partition-count=1, total-data-files=2134, total-delete-files=1608, total-records=248302696, total-files-size=41244442373, total-position-deletes=185726747, total-equality-deletes=0}, manifest-list=s3://demobucket/ts.db/pack/metadata/snap-4579799571894291545-0-87990d26-3c22-4df0-a590-76c2805d95f1.avro, schema-id=0} org.apache.iceberg.IncrementalFileCleanup Expired snapshot: BaseSnapshot{id=4640762571463645975, timestamp_ms=1733284113116, operation=append, summary={added-files-size=17668611, added-data-files=1, added-records=116928, changed-partition-count=1, total-data-files=2135, total-delete-files=1608, total-records=248419624, total-files-size=41262110984, total-position-deletes=185726747, total-equality-deletes=0}, manifest-list=s3://demobucket/ts.db/pack/metadata/snap-4640762571463645975-0-4b118704-b931-42e7-b7b3-3453766c746e.avro, schema-id=0} <...> org.apache.iceberg.IncrementalFileCleanup Expired snapshot: BaseSnapshot{id=9027409204264082391, timestamp_ms=1733285284267, operation=delete, summary={trino_query_id=20241204_040721_00185_5fm9y, added-position-delete-files=183, added-delete-files=183, added-files-size=27888755, added-position-deletes=21059718, changed-partition-count=2, total-records=260746407, total-files-size=43279826772, total-data-files=2242, total-delete-files=1791, total-position-deletes=206786465, total-equality-deletes=0, iceberg-version=Apache Iceberg 1.6.1 (commit 8e9d59d299be42b0bca9461457cd1e95dbaad086)}, manifest-list=s3://demobucket/ts.db/pack/metadata/snap-9027409204264082391-2-ad111312-2980-4071-9f7d-b819dfc1ed21.avro, schema-id=0} ``` I can only assume the old data files are still referenced by manifests? But how can we investigate this? What are we missing? The relevant source code seems to be: https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java#L261C17-L261C30 Paul ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [X] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org