reele commented on issue #16976:
URL:
https://github.com/apache/dolphinscheduler/issues/16976#issuecomment-2608725571
@ruanwenjun
i did some test for a 5mins failed retry task, run workflow, the task failed
and waiting retry, and stop the workflow, the workflow stop after 5mins.
```log
...
[WI-0][TI-0] - 2025-01-22 14:55:19.537 INFO
[MasterRpcServer-methodInvoker-5] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish
event: TaskRunningLifecycleEvent{task=<Task-with-retry>, runtimeContext=null}
[WI-3954361][TI-0] - 2025-01-22 14:55:19.641 INFO
[ds-workflow-eventbus-worker-11]
o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task
<Task-with-retry> TaskRunningLifecycleEvent{task=<Task-with-retry>,
runtimeContext=null} with state RUNNING_EXECUTION
[WI-0][TI-0] - 2025-01-22 14:55:20.400 INFO
[MasterRpcServer-methodInvoker-12] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish
event: TaskFailedLifecycleEvent{task=<Task-with-retry>, endTime=Wed Jan 22
14:55:20 GMT+08:00 2025}
[WI-3954361][TI-0] - 2025-01-22 14:55:20.445 INFO
[ds-workflow-eventbus-worker-10] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish
event: TaskRetryLifecycleEvent{task=<Task-with-retry>, delayTime=300096/ms}
[WI-3954361][TI-0] - 2025-01-22 14:55:20.447 INFO
[ds-workflow-eventbus-worker-10]
o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task
<Task-with-retry> TaskFailedLifecycleEvent{task=<Task-with-retry>, endTime=Wed
Jan 22 14:55:20 GMT+08:00 2025} with state RUNNING_EXECUTION
[WI-0][TI-0] - 2025-01-22 14:55:34.205 INFO
[MasterRpcServer-methodInvoker-27] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish
event:
WorkflowStopLifecycleEvent{workflow=<Workflow-with-retry-task>-20250122145518737}
@@@@#### here was blocking WorkflowStopLifecycleEvent for 5mins ####@@@@
[WI-3954361][TI-0] - 2025-01-22 15:00:20.577 INFO
[ds-workflow-eventbus-worker-20] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish
event: TaskStartLifecycleEvent{task=<Task-with-retry>}
[WI-3954361][TI-0] - 2025-01-22 15:00:20.578 INFO
[ds-workflow-eventbus-worker-20]
o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task
<Task-with-retry> TaskRetryLifecycleEvent{task=<Task-with-retry>,
delayTime=300096/ms} with state FAILURE
[WI-3954361][TI-0] - 2025-01-22 15:00:20.579 INFO
[ds-workflow-eventbus-worker-20]
o.a.d.s.m.e.w.l.h.AbstractWorkflowLifecycleEventHandler:[47] - Begin fire
workflow <Workflow-with-retry-task>-20250122145518737
LifecycleEvent[WorkflowStopLifecycleEvent{workflow=<Workflow-with-retry-task>-20250122145518737}]
with state: RUNNING_EXECUTION
[WI-3954361][TI-0] - 2025-01-22 15:00:20.582 INFO
[ds-workflow-eventbus-worker-20]
o.a.d.s.m.e.w.s.AbstractWorkflowStateAction:[150] - Success set
WorkflowExecuteRunnable: <Workflow-with-retry-task>-20250122145518737 state
from: RUNNING_EXECUTION to READY_STOP
...
```
and i just found the main reason is here !!:
https://github.com/apache/dolphinscheduler/blob/352b47bd8576a47f83285ecfffec589de462fac0/dolphinscheduler-eventbus/src/main/java/org/apache/dolphinscheduler/eventbus/AbstractDelayEvent.java#L62-L64
AbstractDelayEvent use createTimeInNano to compare other event, DelayQueue
will sort the events using createTimeInNano, so the retry event was first put
in queue, DelayQueue will take retry event first.
if i change the compared value `createTimeInNano` to `createTimeInNano +
delayTime`, that will not block the WorkflowStopLifecycleEvent any more.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]