[
https://issues.apache.org/jira/browse/FLINK-37399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-37399:
-----------------------------------
Labels: pull-request-available (was: )
> Watermark alignment can prevent backlogged jobs from using all available
> resources
> ----------------------------------------------------------------------------------
>
> Key: FLINK-37399
> URL: https://issues.apache.org/jira/browse/FLINK-37399
> Project: Flink
> Issue Type: Improvement
> Components: API / DataStream
> Affects Versions: 2.0.0, 1.18.1, 1.19.2
> Reporter: Piotr Nowojski
> Assignee: Piotr Nowojski
> Priority: Major
> Labels: pull-request-available
>
> Imagine the following scenario. Max allowed watermark alignment drift is set
> to 30s and watermark alignment is synced/announced across subtasks every 1s.
> This makes perfect sense during normal records processing.
> But when processing messages from backlog, it creates a problem, defacto
> capping how quickly watermarks can be progressing. For example:
> * at {{t0}} we announce that max allowed watermark is {{max_w0 =
> min(reported_watermark) + 30s}}
> * next announcement will happen at {{t1 = t0 + 1s}}
> * any source/partition that exceeds {{max_w0}} before the next announcement
> happens at {{t1}}, will be blocked by the watermark alignment
> In other words, with the above configuration (30s allowed drift announced
> every ~1s), we are de facto capping the backlog processing speed to:
> *30 "event time" seconds per each 1 "real world" second*
> The problem is the delay in how quickly watermark announcements are
> propagating between operators and the coordinator. By the time operator
> receives what is max allowed watermark, that value is already ~1s old.
> Symptoms of this problem are:
> * non empty backlog/pending records
> * job is under utilised, with all subtasks being at least partially idle (for
> example 50% idle)
> In other words, there are records to be processed from the backlog, job has
> resources to process more records more quickly, but it doesn't because
> watermark alignment is blocking the progress, constantly pausing/resuming all
> of the splits.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)