[I] Flink Rewrite Files Action OOM [iceberg]

via GitHub Fri, 01 Dec 2023 03:40:42 -0800


bhupixb opened a new issue, #9193:
URL: https://github.com/apache/iceberg/issues/9193


   ### Apache Iceberg version
   
   1.4.1
   
   ### Query engine
   
   Flink
   
   ### Please describe the bug 🐞
   
   ##### Background:
   We are using the flink iceberg sinks to write data to an iceberg table(using 
hive catalog, storage S3) in _**table-format v2, upsert enabled**_. 
   We have written ~25 million records on a table (daily partitioned) and the 
Table sort order is on 3 fields(all string, having max 32 char length).
   
   Table Schema:
   ```java
   public class SideOutput {
     private String input_record;
     private String operator;
     private String error_message;
     private String error_code;
     private String job_id;
     private String tenant_id;
     private Integer unique_id;
     private String status;
     private Long processed_at;
     private String job_creation_date;
     private String record_id;
     private String parent_operator;
   }
   ```
   
   We are in the POC phase only, and the data is written successfully(only 25m 
records in 1 day partition, no other data). We have checkpoint of 60s, and 
around ~350 files are written in the S3. Each file size range from ~10kb to 
~3mb.
   
   #### Issue:
   When we are trying to read data from this table using trino, our trino 
workers occupy whole memory(4 workers, each 20gb) and Major GC trigger 
frequently and eventually they die.
   Earlier, I thought we might have a problem with trino infra.
   So I wrote a flink job to read this data simple query like (select a, b, c, 
count(1) from table group by a,b,c; This query will have max 5-6 rows). And 
surprisingly, we face the same **issue of heap OOM** in TM.
   
   Then I came across this doc, 
https://iceberg.apache.org/docs/latest/flink-actions/ to rewrite small files 
into larger one's. So I wrote another flink job with 4 parallelism, TM (memory 
16 GB, 8 core CPU) to perform this re-write action. This job runs for some time 
and eventually, the TM dies. I can see TM CPU going 100%, heap also ~95%.
   We took a heap dump of worker and see the memory is occupied by HashMap:
   
   <img width="906" alt="image" 
src="https://github.com/apache/iceberg/assets/30554307/9394a4f8-5b3e-470d-8f88-b88e5f618c6e";>
   
   I went through a bunch of github issues like 
https://github.com/apache/iceberg/issues/6104 but not really able to figure out 
what is causing the issue.
   Flink Version: 1.16.0
   Iceberg version: 1.4.1
   
   We tried both batch and streaming mode for RewriteActions job, but no luck.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Flink Rewrite Files Action OOM [iceberg]

Reply via email to