zohar-plutoflume opened a new issue, #7379:
URL: https://github.com/apache/iceberg/issues/7379
### Apache Iceberg version
0.14.1
### Query engine
EMR
### Please describe the bug 🐞
We noticed that the delete command which executes successfully actually does
not delete the data.
so an example query would be:
```
delete * from table where tenant_id=690
```
which we would expect to delete everything for this tenant, we still get
records left.
but when we query the table after the delete:
```
select count(*) from table where tenant_id=690
```
it returns 7 records
now for the details:
(emr 6.9.0 iceberg version - 0.14.1, spark version 3.3.0)
I can't reproduce the issue locally , so unfortunately I can only show the
info I got from trying to debug it from the logs:
job correctly loads the table:
````
2023-04-19T12:32:12,561 INFO iceberg.BaseMetastoreTableOperations:
Refreshing table metadata from new version:
s3://prod-tessian-platform.com-data-lake/email_check_outbound_priority/metadata/32444-ecaf012a-6ff8-4485-a4a5-3343cbc46e00.metadata.json
```
2. job correctly understands that the column we delete from is a partition
column and the operation is a metadata operation only:
```
2023-04-19T12:32:15,732 INFO iceberg.BaseTableScan: Scanning table
iceberg.iceberg_db.email_check_outbound_priority snapshot 8530920662702686267
created at 2023-04-19 12:20:25.224 with filter tenant_id = (3-digit-int)
2023-04-19T12:32:17,625 INFO v2.OptimizeMetadataOnlyDeleteFromIcebergTable$:
Optimizing delete expression: EqualTo(tenant_id,690) as metadata delete
```
3. job correctly commits a new iceberg snapshot:
```
2023-04-19T12:32:21,859 INFO iceberg.BaseMetastoreTableOperations:
Successfully committed to table
iceberg.iceberg_db.email_check_outbound_priority in 456 ms
2023-04-19T12:32:21,859 INFO iceberg.SnapshotProducer: Committed snapshot
1441441847084407586 (StreamingDelete)
```
4. snapshot is found in the table:
```
2023-04-19
12:32:21.233|1441441847084407586|8530920662702686267|delete|s3://prod-tessian-platform.com-data-lake/email_check_outbound_priority/metadata/snap-1441441847084407586-1-e688a3b4-a062-4464-8b78-47c432cedd69.avro%7C
{
spark.app.id -> application_1681907352232_0001,
changed-partition-count -> 0,
total-records -> 28127739,
total-files-size -> 3851046875,
total-data-files -> 234706,
total-delete-files -> 0,
total-position-deletes -> 0,
total-equality-deletes -> 0
}
```
and yet the data is still there:
```
"SELECT COUNT (*) FROM iceberg.iceberg_db.email_check_outbound_priority
WHERE tenant_id = 690"
+--------+
|count(1)|
+--------+
|7 |
+--------+
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]