[I] Table maintenace procedure(expire_snapshots) not work as expceted [iceberg]

via GitHub Fri, 09 Aug 2024 05:06:41 -0700


toien opened a new issue, #10907:
URL: https://github.com/apache/iceberg/issues/10907


   ### Query engine
   
   Spark SQL on AWS EMR(7.1.0)
   
   Versions: 
   - Spark: 3.5.0
   - Iceberg: 1.4.3
   - Flink: 1.18 (Managed Apache Flink of AWS)
   
   ### Question
   
   First i create an iceberg table like:
   
   ```sql
   spark-sql (test_db)> show create table my_catalog.test_db.dws_table;
   CREATE TABLE my_catalog.test_db.dws_table (
     dt STRING NOT NULL,
     brand_code STRING NOT NULL,
     event_type STRING NOT NULL,
     sub_event_type STRING NOT NULL,
     success_count INT,
     failed_count INT)
   USING iceberg
   LOCATION 's3://xxx/test/test_db.db/dws_table'
   TBLPROPERTIES (
     'current-snapshot-id' = '3745013875610091505',
     'format' = 'iceberg/parquet',
     'format-format' = '2',
     'format-version' = '2',
     'identifier-fields' = '[dt,brand_code,sub_event_type,event_type]',
     'write.metadata.delete-after-commit.enabled' = 'true',
     'write.metadata.previous-versions-max' = '5',
     'write.parquet.compression-codec' = 'zstd',
     'write.upsert.enabled' = 'true')
   ```
   
   Flink streaming jobs will calc results and upsert into this table. so that 
would create many snapshots by Flink checkpoints:
   
   ```sql
   spark-sql (test_db)> select COUNT(*) from 
my_catalog.test_db.dws_table.snapshots;
   2130
   ```
   
   Here is the problem: When I use Spark SQL do `expire_snapshots`, It **DO 
cost time** to execute this job
   ```sql
   spark-sql (test_db)> CALL my_catalog.system.expire_snapshots(
                      >   table => 'test_db.dws_table',
                      >   retain_last => 5
                      > );
   deleted_data_files_count        deleted_position_delete_files_count     
deleted_equality_delete_files_count     deleted_manifest_files_count    
deleted_manifest_lists_count    deleted_statistics_files_count
   0       0       0       0       0       0
   Time taken: 45.336 seconds, Fetched 1 row(s)
   ```
   
   But nothing been deleted! 
   
   ```sql
   spark-sql (test_db)> select COUNT(*) from 
my_catalog.test_db.dws_table.snapshots;
   2164
   ```
   
   And data on S3 still there.
   
   Spark Job finished successfully:
   
![iceberg-maintenance-failed](https://github.com/user-attachments/assets/6249e24b-52c0-4083-97a4-07248157e835)
   
   The same problem occurs when call `rewrite_data_files` **TOO**, small data 
files are **NOT** been compacted(merged).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Table maintenace procedure(expire_snapshots) not work as expceted [iceberg]

Reply via email to