njnu-seafish opened a new issue, #17436: URL: https://github.com/apache/dolphinscheduler/issues/17436
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened ### 1. Create a shell task and configure the timeout failure strategy <img width="1006" height="837" alt="Image" src="https://github.com/user-attachments/assets/d5b67942-6d3f-4687-8da8-3b541a268c47" /> ### 2. Manually kill the task, and the logs show kill success operation. (**Only call the cancelApplication method once.**) 2025-08-15 13:49:33.105 INFO [WorkerRpcServer-methodInvoker-224] - Publish TaskExecutorKillLifecycleEvent: { "taskInstanceId" : 1081, "eventCreateTime" : 1755236973105, "type" : "KILL" } 2025-08-15 13:49:33.147 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Begin killing task instance, processId: 749659 2025-08-15 13:49:33.449 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - prepare to parse pid, raw pid string: sudo(749659)---1081.sh(749674)---sleep(749748) 2025-08-15 13:49:34.003 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Sending SIGINT to process group: 749659 749674 749748, command: sudo -u dolphinscheduler -i kill -s SIGINT 749659 749674 749748 2025-08-15 13:49:44.992 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Kill command: sudo -u dolphinscheduler -i kill -s SIGINT 749659 749674 749748, timed out, still running PIDs: 749659 749674 749748 2025-08-15 13:49:45.545 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Sending SIGTERM to process group: 749659 749674 749748, command: sudo -u dolphinscheduler -i kill -s SIGTERM 749659 749674 749748 2025-08-15 13:49:46.253 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Successfully killed process tree using SIGTERM, processId: 749659 2025-08-15 13:49:46.254 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Process tree for task: 1081 is killed or already finished, pid: 749659 2025-08-15 13:49:46.254 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Get appIds from worker 192.168.30.121:1234, taskLogPath: /data01/dolphinscheduler/20250815/149143631011392/1/1015/1081.log 2025-08-15 13:49:46.254 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Start finding appId in /data01/dolphinscheduler/20250815/149143631011392/1/1015/1081.log, fetch way: log 2025-08-15 13:49:46.254 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - The appId is empty 2025-08-15 13:49:46.254 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-87] - Success fire TaskExecutorKillLifecycleEvent: { "taskInstanceId" : 1081, "eventCreateTime" : 1755236973105, "type" : "KILL" } 2025-08-15 13:49:46.360 INFO [exclusive-task-executor-container-worker-0] - process has exited. execute path:/data01/dolphinscheduler/exec/process/1081, processId:749659 ,exitStatusCode:143 ,processWaitForStatus:true ,processExitValue:143 ### 3, However, an exception was thrown when killing due to timeout. (**The cancelApplication method was called twice.**) 2025-08-15 16:55:37.289 INFO [WorkerRpcServer-methodInvoker-31] - Publish TaskExecutorKillLifecycleEvent: { "taskInstanceId" : 1084, "eventCreateTime" : 1755248137289, "type" : "KILL" } 2025-08-15 16:55:37.333 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-0] - Begin killing task instance, processId: 837363 2025-08-15 16:55:37.730 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-0] - prepare to parse pid, raw pid string: sudo(837363)---1084.sh(837379)---sleep(837453) 2025-08-15 16:55:38.316 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-0] - Sending SIGINT to process group: 837363 837379 837453, command: sudo -u dolphinscheduler -i kill -s SIGINT 837363 837379 837453 2025-08-15 16:55:49.325 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-0] - Kill command: sudo -u dolphinscheduler -i kill -s SIGINT 837363 837379 837453, timed out, still running PIDs: 837363 837379 837453 2025-08-15 16:55:49.876 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-0] - Sending SIGTERM to process group: 837363 837379 837453, command: sudo -u dolphinscheduler -i kill -s SIGTERM 837363 837379 837453 2025-08-15 16:55:50.166 ERROR [exclusive-task-executor-container-worker-0] - process has failure, the task timeout configuration value is:60, ready to kill ... 2025-08-15 16:55:50.167 INFO [exclusive-task-executor-container-worker-0] - Begin killing task instance, processId: 837363 2025-08-15 16:55:50.566 INFO [exclusive-task-executor-container-worker-0] - prepare to parse pid, raw pid string: 2025-08-15 16:55:50.567 ERROR [exclusive-task-executor-container-worker-0] - Kill task instance error, processId: 837363 java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:592) at java.lang.Integer.parseInt(Integer.java:615) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils.kill(ProcessUtils.java:124) at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.cancelApplication(AbstractCommandExecutor.java:216) at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.run(AbstractCommandExecutor.java:196) at org.apache.dolphinscheduler.plugin.task.shell.ShellTask.handle(ShellTask.java:85) at org.apache.dolphinscheduler.server.worker.executor.PhysicalTaskExecutor.doTriggerTaskPlugin(PhysicalTaskExecutor.java:74) at org.apache.dolphinscheduler.task.executor.AbstractTaskExecutor.start(AbstractTaskExecutor.java:80) at org.apache.dolphinscheduler.task.executor.worker.TaskExecutorWorker.start(TaskExecutorWorker.java:65) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 2025-08-15 16:55:50.567 ERROR [exclusive-task-executor-container-worker-0] - Failed to kill process tree for task: 1084, pid: 837363 2025-08-15 16:55:50.567 INFO [exclusive-task-executor-container-worker-0] - Get appIds from worker 192.168.30.121:1234, taskLogPath: /data01/dolphinscheduler/20250815/149143631011392/1/1018/1084.log 2025-08-15 16:55:50.567 INFO [exclusive-task-executor-container-worker-0] - Start finding appId in /data01/dolphinscheduler/20250815/149143631011392/1/1018/1084.log, fetch way: log 2025-08-15 16:55:50.567 INFO [exclusive-task-executor-container-worker-0] - The appId is empty 2025-08-15 16:55:50.568 INFO [exclusive-task-executor-container-worker-0] - process has exited. execute path:/data01/dolphinscheduler/exec/process/1084, processId:837363 ,exitStatusCode:-1 ,processWaitForStatus:false ,processExitValue:143 2025-08-15 16:55:50.568 INFO [exclusive-task-executor-container-worker-0] - Publish TaskExecutorFailedLifecycleEvent: { "taskInstanceId" : 1084, "eventCreateTime" : 1755248150568, "type" : "FAILED", "workflowInstanceId" : 1018, "workflowInstanceHost" : "192.168.30.11:5678", "taskInstanceHost" : "192.168.30.121:1234", "appIds" : "", "endTime" : 1755248150568, "latestReportTime" : null } ### What you expected to happen 1, Task timeout kill don't throw exception 2, It's best to trigger the kill action only once. ### How to reproduce 1. Create a shell task and configure the timeout failure strategy <img width="1006" height="837" alt="Image" src="https://github.com/user-attachments/assets/d5b67942-6d3f-4687-8da8-3b541a268c47" /> 2. Run the workflow, wait to kill the task after timeout ### Anything else _No response_ ### Version dev ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
