liaoxin01 commented on PR #64301:
URL: https://github.com/apache/doris/pull/64301#issuecomment-4725407818
Suggestion on the timeout path: `processTimeoutTasks` calls
`fetchProgress()` (a blocking brpc to cdc_client) on **every** tick, but the
progress only matters when the task already looks timed out. Better to do the
cheap local check first and only fetch to confirm:
```
if (!runningMultiTask.budgetExceeded()) { // now - lastProgressMs <=
timeoutMs, no RPC
return;
}
StreamingTaskProgress progress = runningMultiTask.fetchProgress(); // only
at the boundary
// then under writeLock: isTimeout(progress) -> renew or kill
```
Why it matters: the running tick fires every `max_interval` (default
**10s**) on the shared `insert-task-execute` pool
(`job_insert_task_consumer_thread_num`, default **10**, shared by all
INSERT/streaming jobs), and each RPC blocks up to
`streaming_cdc_light_rpc_timeout_sec` (**90s**). With the current code every
running job does an unconditional `fetchProgress` every 10s; plus
`detectTaskFailure` adds a second unconditional `getFailReason` RPC per tick.
If cdc_client degrades, a single tick can hold a pool thread for ~180s, and
enough such jobs can starve the whole pool.
With the lazy check, `fetchProgress` drops from once/10s to at most once per
timeout window (>=300s). Consider throttling the `detectTaskFailure`
getFailReason similarly (e.g. every N ticks, or piggyback the fail reason on
the existing `fetchMeta` response) so the two RPCs do not stack on that bounded
pool.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]