GitHub user BilgeKaanGencdogan edited a discussion: Task dispatch fails with
connection refused error to worker host ip:1234
Before I began to describe the situation, I'd like to give the technical
details about the system;
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
* Dolphinscheduler version; 3.2.0 and it is standalone, not cluster, not
running on docker or k8s, AND THIS PROD
```
*
[user@user ~]$ free -h
total used free shared buff/cache available
Mem: 192Gi 64Gi 105Gi 2.4Gi 23Gi 124Gi
Swap: 8.0Gi 0B 8.0Gi
```
```
*
[user@user ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 97G 0 97G 0% /dev
tmpfs 97G 1.1M 97G 1% /dev/shm
tmpfs 97G 2.3G 94G 3% /run
tmpfs 97G 0 97G 0% /sys/fs/cgroup
/dev/mapper/rhel-root 202G 16G 187G 8% /
/dev/mapper/rhel-usr 10G 5.3G 4.8G 53% /usr
/dev/mapper/vgdata-lvdata 400G 20G 381G 5% /data
/dev/sda2 2.0G 439M 1.6G 22% /boot
/dev/sda1 2.0G 5.9M 2.0G 1% /boot/efi
tmpfs 20G 0 20G 0% /run/user/1007
tmpfs 20G 8.0K 20G 1% /run/user/1006
```
```
*
[user@user ~]$ java --version
openjdk 11.0.24 2024-07-16 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.24.0.8-2) (build 11.0.24+8-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.24.0.8-2) (build 11.0.24+8-LTS, mixed
mode, sharing)
```
* Both master and worker server's jvm_args_env.sh;
```
*
[root@user bin]# cat /data/dolphin/master-server/bin/jvm_args_env.sh
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
-Xms32g
-Xmx32g
-Xmn16g
-XX:+IgnoreUnrecognizedVMOptions
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-Xloggc:gc.log
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=dump.hprof
-Duser.timezone=${SPRING_JACKSON_TIME_ZONE}
[root@user bin]# cat /data/dolphin/worker-server/bin/jvm_args_env.sh
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
-Xms32g
-Xmx32g
-Xmn16g
-XX:+IgnoreUnrecognizedVMOptions
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-Xloggc:gc.log
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=dump.hprof
-Duser.timezone=${SPRING_JACKSON_TIME_ZONE}
```
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
* Now, I am gonna provide 2 logs from both master-server and worker-server;
`First one is master-server's log:`
```
[WARN] 2025-10-02 09:01:26.576 +0300
org.apache.dolphinscheduler.remote.NettyRemotingClient:[321] -
[WorkflowInstance-0][TaskInstance-0] - connect to Host(ip=IP, port=PORT) error
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..)
failed: Connection refused: /IP:PORT
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection
refused
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
at
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.base/java.lang.Thread.run(Thread.java:829)
[ERROR] 2025-10-02 09:01:26.576 +0300
org.apache.dolphinscheduler.server.master.runner.GlobalTaskDispatchWaitingQueueLooper:[87]
- [WorkflowInstance-0][TaskInstance-0] - Dispatch task failed
org.apache.dolphinscheduler.server.master.exception.TaskDispatchException:
Dispatch task to IP:PORT failed
at
org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.doDispatch(BaseTaskDispatcher.java:101)
at
org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.dispatchTask(BaseTaskDispatcher.java:74)
at
org.apache.dolphinscheduler.server.master.runner.GlobalTaskDispatchWaitingQueueLooper.run(GlobalTaskDispatchWaitingQueueLooper.java:79)
Caused by: org.apache.dolphinscheduler.remote.exceptions.RemotingException:
connect to : Host(ip=IP, port=PORT) fail
at
org.apache.dolphinscheduler.remote.NettyRemotingClient.sendSync(NettyRemotingClient.java:210)
at
org.apache.dolphinscheduler.server.master.rpc.MasterRpcClient.sendSyncCommand(MasterRpcClient.java:49)
at
org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.doDispatch(BaseTaskDispatcher.java:87)
... 2 common frames omitted
```
`Second one is worker-server's log:`
```
[INFO] 2025-10-02 09:01:25.764 +0300
org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[289]
- [WorkflowInstance-59856][TaskInstance-507499] - The current execute mode
isn't develop mode, will clear the task execute file:
/data/dolphin/exec/process/default/15236034355840/15257910743307_11/59856/507499
[INFO] 2025-10-02 09:01:25.765 +0300
org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[304]
- [WorkflowInstance-59856][TaskInstance-507499] - Success clear the task
execute file:
/data/dolphin/exec/process/default/15236034355840/15257910743307_11/59856/507499
[INFO] 2025-10-02 09:01:25.765 +0300
org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[330]
- [WorkflowInstance-59856][TaskInstance-507499] - FINALIZE_SESSION
[INFO] 2025-10-02 09:01:52.136 +0300 org.apache.zookeeper.ClientCnxn:[1171] -
[WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server
localhost/0:0:0:0:0:0:0:1:2181.
[INFO] 2025-10-02 09:01:52.136 +0300 org.apache.zookeeper.ClientCnxn:[1173] -
[WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to
authenticate using SASL (unknown error)
[INFO] 2025-10-02 09:01:59.281 +0300 org.apache.zookeeper.ClientCnxn:[1005] -
[WorkflowInstance-0][TaskInstance-0] - Socket connection established,
initiating session, client: /0:0:0:0:0:0:0:1:40934, server:
localhost/0:0:0:0:0:0:0:1:2181
[INFO] 2025-10-02 09:01:59.282 +0300 org.apache.zookeeper.ClientCnxn:[1444] -
[WorkflowInstance-0][TaskInstance-0] - Session establishment complete on server
localhost/0:0:0:0:0:0:0:1:2181, session id = 0x10000001be900ac, negotiated
timeout = 30000
[INFO] 2025-10-02 09:01:59.282 +0300
org.apache.curator.framework.state.ConnectionStateManager:[252] -
[WorkflowInstance-0][TaskInstance-0] - State change: RECONNECTED
[INFO] 2025-10-02 09:01:59.580 +0300
org.apache.dolphinscheduler.server.worker.processor.WorkerTaskUpdatePidAckProcessor:[59]
- [WorkflowInstance-0][TaskInstance-507499] - task execute update pid ack
command : TaskUpdateRuntimeAckMessage(success=true, taskInstanceId=507499)
[INFO] 2025-10-02 09:01:59.580 +0300
org.apache.dolphinscheduler.server.worker.processor.WorkerTaskExecuteResultAckProcessor:[58]
- [WorkflowInstance-0][TaskInstance-507499] - Receive task execute response
ack command :
TaskExecuteResultMessageAck(super=BaseMessage(messageSenderAddress=IP:5678,
messageReceiverAddress=IP:1234, messageSendTime=1759384886490),
taskInstanceId=507499, success=true)
```
### **WHAT HAPPENED ?**
The DolphinScheduler worker service on the machine experienced a critical
failure on October 2, 2025 at 09:40 AM, causing port 1234 to stop listening and
resulting in "Connection refused" errors from the master server. This
connection problem made the CPU hit the even 100%, dolphinscheduler jobs did
not finish properly and hung in the air. Eventually, so to speak there is
traffic jam. However, All the services were up all the time.
### **REASONABLE FINDINGS FROM US**
The cause was catastrophic thread leak, not memory shortage. The worker
accumulated 21,466+ threads (growing at ~100 threads/minute) over 77.7 days of
operation, consuming approximately 21 GB of RAM for thread stacks alone. This
caused garbage collection pauses to degrade from 80ms to over 1,100ms, making
the system unresponsive. Eventually, the Netty event executor terminated, port
1234 stopped listening, and the worker became non-functional. The system was
manually restarted at 11:38 AM and has been running since, but the thread leak
is still active and growing, making another failure inevitable within days or
weeks.
### **REASONABLE SOLUTIONS FROM US**
`* Reduce Young Generation Size (Prevent long GC pauses)`
```
# Edit worker JVM configuration
vim /data/dolphin/worker-server/bin/jvm_args_env.sh
# Change from:
-Xmn16g
# To:
-Xmn8g
# Restart worker
```
`* Reduce Concurrent Task Limit`
```
# Edit worker configuration
vim /data/dolphin/worker-server/conf/application.yaml
# Find and change:
worker:
exec-threads: 100 # Change to 50
# Add new lines:
max-cpu-load-avg: 0.7
reserved-memory: 0.3
# Restart worker
```
```
* Investigate DolphinScheduler Thread Pool Bug
This is the actual root cause that must be fixed.
```
```
- Bug in DolphinScheduler's thread pool implementation
- Task completion handlers not cleaning up threads
- Executor service not properly bounded
- Thread factory creating threads without limit
```
### **WHAT TO EXPECT**
I just want you to read carefully all the information that I provided and also
please assess the solutions that are decided by us, because this system is very
critical to us, before we implement those soluotions we want to process very
cautiously. Can these solutions solve the problem here? And also when I
searched through web, JVM Heap memory management can play critical role in
here, because of that I want you to guide me also about the JVM Heap management
for the performance issue.
Thanks in advance
GitHub link: https://github.com/apache/dolphinscheduler/discussions/17571
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]