paul-bormans-pcgw opened a new issue, #11695:
URL: https://github.com/apache/iceberg/issues/11695

   ### Apache Iceberg version
   
   1.6.1
   
   ### Query engine
   
   Trino
   
   ### Please describe the bug 🐞
   
   I'm running iceberg on a compose setup and have 2 concurrent writers:
   1) doing appends using pyIceberg
   2) doing a DELETE query using Trino + followed by a expire-snapshots also 
using Trino.
   
   I'm using the following properties when creating the table:
   ```
               table = self.catalog.create_table(
                   identifier=...,
                   schema=...,
                   properties={
                       "gc.enabled": True,
                       "commit.retry.num-retries": 4, 
                       "write.delete.isolation-level": "snapshot", 
                       "write.update.isolation-level": "snapshot",
                       "write.merge.isolation-level": "snapshot",
                   },
   ```
   
   Also I'm using just a JDBC catalog, for instance the Trino connector config:
   ```
   connector.name=iceberg
   iceberg.catalog.type=jdbc
   iceberg.jdbc-catalog.catalog-name=sql
   iceberg.jdbc-catalog.driver-class=org.postgresql.Driver
   iceberg.jdbc-catalog.connection-url=jdbc:postgresql://postgres:5432/catalog
   iceberg.jdbc-catalog.connection-user=postgres
   iceberg.jdbc-catalog.connection-password=postgres
   iceberg.jdbc-catalog.default-warehouse-dir=s3://demobucket
   fs.native-s3.enabled=true
   s3.endpoint=http\://minio\:9000/
   s3.path-style-access=true
   s3.region=us-east-1
   s3.aws-access-key=minioadmin
   s3.aws-secret-key=minioadmin
   iceberg.expire-snapshots.min-retention=2h
   iceberg.remove-orphan-files.min-retention=1h
   ```
   
   pyIceberg is committing new data (FastAppend) every few second; for instance:
   ```
         {
           "snapshot-id": 2401014885715513300,
           "parent-snapshot-id": 5514772428877076000,
           "sequence-number": 3783,
           "timestamp-ms": 1733304538706,
           "manifest-list": 
"s3://demobucket/ts.db/pack/metadata/snap-2401014885715513233-0-2609e33f-b31d-4425-bcc6-bd074de3012f.avro",
           "summary": {
             "operation": "append",
             "added-files-size": "18618432",
             "added-data-files": "1",
             "added-records": "107768",
             "changed-partition-count": "1",
             "total-data-files": "3761",
             "total-delete-files": "3399",
             "total-records": "438076882",
             "total-files-size": "72715662134",
             "total-position-deletes": "393434960",
             "total-equality-deletes": "0"
           },
           "schema-id": 0
         },
         {
           "snapshot-id": 5605442414867702000,
           "parent-snapshot-id": 2401014885715513300,
           "sequence-number": 3784,
           "timestamp-ms": 1733304550871,
           "manifest-list": 
"s3://demobucket/ts.db/pack/metadata/snap-5605442414867701673-0-c0eef6ae-41b7-45ee-a777-299279b632ea.avro",
           "summary": {
             "operation": "append",
   ```
   
   To cleanup older data we run following Query:
   ```
   DELETE FROM pack WHERE epoch_timestamp_tz <= timestamp '2024-12-?? ??:??' 
AND timestampns < n
   ```
   
   This correctly creates a delete commit; for instance:
   ```
         {
           "snapshot-id": 3134131428617513500,
           "parent-snapshot-id": 2181664209442251000,
           "sequence-number": 3631,
           "timestamp-ms": 1733302382646,
           "manifest-list": 
"s3://demobucket/ts.db/pack/metadata/snap-3134131428617513346-2-43520ef4-077e-4b06-a8db-85c8f2c12e43.avro",
           "summary": {
             "operation": "delete",
             "trino_query_id": "20241204_084910_00399_5fm9y",
             "added-position-delete-files": "177",
             "added-delete-files": "177",
             "added-files-size": "27066930",
             "added-position-deletes": "20433177",
             "changed-partition-count": "1",
             "total-records": "420753387",
             "total-files-size": "69857546227",
             "total-data-files": "3611",
             "total-delete-files": "3219",
             "total-position-deletes": "372374521",
             "total-equality-deletes": "0",
             "iceberg-version": "Apache Iceberg 1.6.1 (commit 
8e9d59d299be42b0bca9461457cd1e95dbaad086)"
           },
           "schema-id": 0
         },
   ```
   
   After the delete query we run expire-snapshots to cleanup old snapshots AND 
old datafiles that were removed by delete-operations earlier; for instance:
   ```
   ALTER TABLE pack EXECUTE expire_snapshots(retention_threshold => '3h')
   ```
   
   From the Trino logging I can see snapshots get expired and also 
delete-operation (snapshots) are expired / removed BUT none of the actual data 
files are removed? What are we missing here?
   ```
   org.apache.iceberg.RemoveSnapshots   Expiring snapshots older than: 
2024-12-04T03:59:01.303+00:00 (1733284741303)
   org.apache.iceberg.RemoveSnapshots   Committed snapshot changes
   org.apache.iceberg.RemoveSnapshots   Cleaning up expired files (local, 
incremental)
   org.apache.iceberg.IncrementalFileCleanup    Expired snapshot: 
BaseSnapshot{id=4579799571894291545, timestamp_ms=1733284090666, 
operation=append, summary={added-files-size=23432710, added-data-files=1, 
added-records=131666, changed-partition-count=1, total-data-files=2134, 
total-delete-files=1608, total-records=248302696, total-files-size=41244442373, 
total-position-deletes=185726747, total-equality-deletes=0}, 
manifest-list=s3://demobucket/ts.db/pack/metadata/snap-4579799571894291545-0-87990d26-3c22-4df0-a590-76c2805d95f1.avro,
 schema-id=0}
   org.apache.iceberg.IncrementalFileCleanup    Expired snapshot: 
BaseSnapshot{id=4640762571463645975, timestamp_ms=1733284113116, 
operation=append, summary={added-files-size=17668611, added-data-files=1, 
added-records=116928, changed-partition-count=1, total-data-files=2135, 
total-delete-files=1608, total-records=248419624, total-files-size=41262110984, 
total-position-deletes=185726747, total-equality-deletes=0}, 
manifest-list=s3://demobucket/ts.db/pack/metadata/snap-4640762571463645975-0-4b118704-b931-42e7-b7b3-3453766c746e.avro,
 schema-id=0}
   <...>
   org.apache.iceberg.IncrementalFileCleanup    Expired snapshot: 
BaseSnapshot{id=9027409204264082391, timestamp_ms=1733285284267, 
operation=delete, summary={trino_query_id=20241204_040721_00185_5fm9y, 
added-position-delete-files=183, added-delete-files=183, 
added-files-size=27888755, added-position-deletes=21059718, 
changed-partition-count=2, total-records=260746407, 
total-files-size=43279826772, total-data-files=2242, total-delete-files=1791, 
total-position-deletes=206786465, total-equality-deletes=0, 
iceberg-version=Apache Iceberg 1.6.1 (commit 
8e9d59d299be42b0bca9461457cd1e95dbaad086)}, 
manifest-list=s3://demobucket/ts.db/pack/metadata/snap-9027409204264082391-2-ad111312-2980-4071-9f7d-b819dfc1ed21.avro,
 schema-id=0}
   ```
   
   I can only assume the old data files are still referenced by manifests? But 
how can we investigate this? What are we missing?
   
   The relevant source code seems to be: 
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java#L261C17-L261C30
   
   Paul
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [X] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to