bk-mz opened a new issue, #9833:
URL: https://github.com/apache/iceberg/issues/9833

   ### Apache Iceberg version
   
   1.4.3 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Hey folks, we're using `rewrite_position_delete_files` to compact delete 
files.
   
   It keeps rewriting data but it does not compact anything, just rewrites 
files with same amount of data into same amount of files.
   
   ```
   CALL glue.system.rewrite_position_delete_files(table => 'table_name', where 
=> 'data_load_ts < current_timestamp() - INTERVAL 1 HOURS', options => 
map('partial-progress.enabled', 'true', 'rewrite-all', 'true', 
'max-concurrent-file-group-rewrites', '50'))
   
+----------------------------+------------------------+---------------------+-----------------+
   
|rewritten_delete_files_count|added_delete_files_count|rewritten_bytes_count|added_bytes_count|
   
+----------------------------+------------------------+---------------------+-----------------+
   |5474                        |5232                    |83456097             
|82859000         |
   
+----------------------------+------------------------+---------------------+-----------------+
   
   
   CALL glue.system.rewrite_position_delete_files(table => 'table_name', where 
=> 'data_load_ts < current_timestamp() - INTERVAL 1 HOURS', options => 
map('partial-progress.enabled', 'true', 'rewrite-all', 'true', 
'max-concurrent-file-group-rewrites', '50'))
   
+----------------------------+------------------------+---------------------+-----------------+
   
|rewritten_delete_files_count|added_delete_files_count|rewritten_bytes_count|added_bytes_count|
   
+----------------------------+------------------------+---------------------+-----------------+
   |5431                        |5265                    |83739802             
|83200333         |
   
+----------------------------+------------------------+---------------------+-----------------+
   
   CALL glue.system.rewrite_position_delete_files(table => 'table_name', where 
=> 'data_load_ts < current_timestamp() - INTERVAL 1 HOURS', options => 
map('partial-progress.enabled', 'true', 'rewrite-all', 'true', 
'max-concurrent-file-group-rewrites', '50'))
   
+----------------------------+------------------------+---------------------+-----------------+
   
|rewritten_delete_files_count|added_delete_files_count|rewritten_bytes_count|added_bytes_count|
   
+----------------------------+------------------------+---------------------+-----------------+
   |5443                        |5244                    |83643303             
|83241939         |
   
+----------------------------+------------------------+---------------------+-----------------+
   ```
   
   As a matter of fact I think it has created an odd partitions which contain 
only small delete files. I suspect what that job does is to keeps rewriting 
those small files all over again having same small files in the end.
   
   Normal partition on s3: `data_load_ts_hour=2024-02-29-06/`
   Odd partition: `data_load_ts_hour=474425/`
   
   There are a lot of those odd partitions. They have an integer which is 
incrementally increasing from `474425` till `474754`.
   I think each run creates a new odd partition.
   
   <img width="1121" alt="image" 
src="https://github.com/apache/iceberg/assets/892781/ac54600d-1ad8-4ee3-b17a-733dffbaaef5";>
   
   Odd partition contains only delete parquet files
   <img width="1156" alt="image" 
src="https://github.com/apache/iceberg/assets/892781/db1a2489-002b-4f5e-a86b-df7fcb90b2e5";>
   
   Can you check and confirm whether this is an issue? So far we had disabled 
`rewrite_position_delete_files` at all b/c the behavior is super-odd.
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to