niyanchun commented on issue #16738:
URL: 
https://github.com/apache/dolphinscheduler/issues/16738#issuecomment-2440653900

   I may find the reason:  in master logs:
   
   ```
    1632 [WI-0][TI-0] - [WARN] 2024-10-28 11:00:02.949 +0800 
o.a.d.s.m.d.h.LowerWeightHostManager:[139] - Worker 10.10.3.9:1234 in 
workerGroup default is Busy, heartbeat is WorkerHeartBeat(workerHostWeight=100, 
threadPoolUsa      ge=11)
    1633 [WI-0][TI-0] - [WARN] 2024-10-28 11:00:02.949 +0800 
o.a.d.s.m.d.h.LowerWeightHostManager:[139] - Worker 10.10.3.15:1234 in 
workerGroup default is Busy, heartbeat is WorkerHeartBeat(workerHostWeight=100, 
threadPoolUs      age=7)
    1634 [WI-0][TI-0] - [WARN] 2024-10-28 11:00:02.949 +0800 
o.a.d.s.m.d.h.LowerWeightHostManager:[139] - Worker 10.10.3.9:1234 in 
workerGroup 采集 is Busy, heartbeat is WorkerHeartBeat(workerHostWeight=100, 
threadPoolUsage=      11)
    1635 [WI-0][TI-0] - [WARN] 2024-10-28 11:00:02.949 +0800 
o.a.d.s.m.d.h.LowerWeightHostManager:[139] - Worker 10.10.3.15:1234 in 
workerGroup 采集 is Busy, heartbeat is WorkerHeartBeat(workerHostWeight=100, 
threadPoolUsage      =7)
    1636 [WI-0][TI-0] - [WARN] 2024-10-28 11:00:02.949 +0800 
o.a.d.s.m.d.h.LowerWeightHostManager:[139] - Worker 10.10.3.9:1234 in 
workerGroup 数仓 is Busy, heartbeat is WorkerHeartBeat(workerHostWeight=100, 
threadPoolUsage=      11)
    1637 [WI-0][TI-0] - [WARN] 2024-10-28 11:00:02.949 +0800 
o.a.d.s.m.d.h.LowerWeightHostManager:[139] - Worker 10.10.3.15:1234 in 
workerGroup 数仓 is Busy, heartbeat is WorkerHeartBeat(workerHostWeight=100, 
threadPoolUsage      =7)
   ```
   
   If worker status is busy,  it will cause  `workerHostWeightsMap` empty or 
without the workgroup:
   
   ```
           /**
            * Sync worker resource.
            *
            * @param workerGroupNodes  worker group nodes, key is worker group, 
value is worker group nodes.
            * @param workerNodeInfoMap worker node info map, key is worker 
node, value is worker info.
            */
           private void syncWorkerResources(final Map<String, Set<String>> 
workerGroupNodes,
                                            final Map<String, WorkerHeartBeat> 
workerNodeInfoMap) {
               try {
                   Map<String, Set<HostWeight>> workerHostWeights = new 
HashMap<>();
                   for (Map.Entry<String, Set<String>> entry : 
workerGroupNodes.entrySet()) {
                       String workerGroup = entry.getKey();
                       Set<String> nodes = entry.getValue();
                       Set<HostWeight> hostWeights = new 
HashSet<>(nodes.size());
                       for (String node : nodes) {
                           WorkerHeartBeat heartbeat = 
workerNodeInfoMap.getOrDefault(node, null);
                           Optional<HostWeight> hostWeightOpt = 
getHostWeight(node, workerGroup, heartbeat);
                           hostWeightOpt.ifPresent(hostWeights::add);
                       }
                       if (!hostWeights.isEmpty()) {
                           workerHostWeights.put(workerGroup, hostWeights);
                       }
                   }
                   syncWorkerHostWeight(workerHostWeights);
               } catch (Throwable ex) {
                   log.error("Sync worker resource error", ex);
               }
           }
   
           private void syncWorkerHostWeight(Map<String, Set<HostWeight>> 
workerHostWeights) {
               workerGroupWriteLock.lock();
               try {
                   workerHostWeightsMap.clear();
                   workerHostWeightsMap.putAll(workerHostWeights);
               } finally {
                   workerGroupWriteLock.unlock();
               }
           }
       }
   
       public Optional<HostWeight> getHostWeight(String workerAddress, String 
workerGroup, WorkerHeartBeat heartBeat) {
           if (heartBeat == null) {
               log.warn("Worker {} in WorkerGroup {} have not received the 
heartbeat", workerAddress, workerGroup);
               return Optional.empty();
           }
           if (ServerStatus.BUSY == heartBeat.getServerStatus()) {
               log.warn("Worker {} in workerGroup {} is Busy, heartbeat is {}", 
workerAddress, workerGroup, heartBeat);
               return Optional.empty();
           }
           return Optional.of(
                   new HostWeight(
                           HostWorker.of(workerAddress, 
heartBeat.getWorkerHostWeight(), workerGroup),
                           heartBeat.getCpuUsage(),
                           heartBeat.getMemoryUsage(),
                           heartBeat.getDiskUsage(),
                           heartBeat.getThreadPoolUsage(),
                           heartBeat.getStartupTime()));
       }
   
   ```
   
   I think this can be optimized, when the server is busy,  it does not means 
work group not exist, it only indicate the server is busy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to