crazychengmm opened a new issue, #17829:
URL: https://github.com/apache/dolphinscheduler/issues/17829

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   Search before asking
    I had searched in the issues and found no similar issues.
   Description
   Version: DolphinScheduler 3.2.0
   
   Scenario:
   I started a workflow via the API. The tasks within the workflow completed 
successfully, but the workflow instance failed to recognize the completion. 
This led to a series of abnormal behaviors.
   
   Abnormal Behaviors:
   
   Infinite Task Creation: The workflow kept creating and submitting new task 
instances even though previous ones succeeded.
   Retry Count Stagnation: The retry_times of the task remained at 1, even 
though the workflow's max retry was configured to 2. It seemed like the 
workflow was not "retrying" the failed task, but rather "re-triggering" new 
tasks from scratch.
   Data Inconsistency: In the Web UI, the workflow instance showed an end_time 
from the past, but its state remained RUNNING.
   Bizarre Behavior after Pausing: When I manually clicked "Pause" on the 
workflow, the system continued to generate new task instances in a PAUSED state 
every few seconds.
   Logs: I checked both Master and Worker logs, but no explicit exceptions or 
error stacks were found during this period.
   Recovery:
   The issue was resolved immediately after restarting the Master cluster. The 
"ghost" tasks stopped being created, and the workflow state synchronized.
   
   Possible Root Cause Suspected:
   It seems like the WorkflowExecuteRunnable or the state machine in the Master 
node entered an inconsistent state/loop where it failed to update the workflow 
status while incorrectly believing it needed to schedule more tasks, 
potentially due to event loss or a race condition in the internal event queue.
   
   Steps to Reproduce
   Start a workflow instance via API in version 3.2.0.
   Observe if the task finishes but the workflow fails to transition to SUCCESS.
   Check if new task instances are generated repeatedly.
   Try to pause the workflow and observe if paused tasks are still being 
created.
   Expected Behavior
   The workflow should transition to SUCCESS once all tasks are finished, and 
no further task instances should be created.
   
   Actual Behavior
   The workflow remains RUNNING (despite having an end_time), keeps creating 
new tasks infinitely, and even creates paused tasks after the workflow is 
paused.
   
   Environment
   OS: [CentOS 7]
   DolphinScheduler Version: 3.2.0
   Storage: [PG]
   Deployment: [Cluster] 3master 6worker
   
   ### What you expected to happen
   
   fix this issue
   
   ### How to reproduce
   
   I don't know how it happen
   
   ### Anything else
   
   <img width="3826" height="1506" alt="Image" 
src="https://github.com/user-attachments/assets/262b25b0-ed8b-4861-935f-1c9a8f2b145a";
 />
   
   taskId 2330628
   
   ### Version
   
   3.2.x
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to