Re: [PR] HDFS-17909. [ARR] AsyncRouterHandlerExecutors should use bounded queue [hadoop]

via GitHub Thu, 07 May 2026 00:18:35 -0700


hfutatzhanghb commented on PR #8448:
URL: https://github.com/apache/hadoop/pull/8448#issuecomment-4394999351


   > > > Hi @kokonguyen191 @hfutatzhanghb @ZanderXu I had the impression that 
DFS_ROUTER_ASYNC_RPC_MAX_ASYNCCALL_PERMIT_KEY could control the maximum number 
of requests. Is it possible to achieve the same effect using this?
   > > 
   > > 
   > > I have the same question with @KeeProMise . @kokonguyen191 Could you 
please clarify here? IIUC, When we use acquirePermit, Router have put this rpc 
call into its call queue. So `DFS_ROUTER_ASYNC_RPC_MAX_ASYNCCALL_PERMIT_KEY` 
can not control the max calls at router side?
   > 
   > Thanks @KeeProMise @hfutatzhanghb for your review.
   > 
   > Let me try to explain my understanding of this issue.
   > 
   > 1. On the RBF side, there are two relevant thread groups: the 8888 
handlers and the NS handlers inside the NS thread pool.
   > 2. The 8888 handlers receive client requests and submit them to the 
corresponding NS thread pool. These requests are put into the thread pool’s 
unbounded queue. The NS handlers then take requests from the queue and process 
them.
   > 3. This is basically a producer-consumer model. The 8888 handlers are the 
producers, and the NS handlers are the consumers. If the producers are faster 
than the consumers, the unbounded queue may keep growing and eventually exhaust 
memory. Once that happens, the whole RBF instance may become unavailable.
   > 
   > Unfortunately, there are several cases where the consumers can become 
slower than the producers:
   > 
   > 1. When NS permits are exhausted, NS handlers may have to wait for a 
permit, up to DFS_ROUTER_FAIRNESS_ACQUIRE_TIMEOUT.
   > 2. Even after an NS handler gets a permit, it may still get blocked while 
sending the request to the downstream NN. For example, the NN may be slow or 
unavailable because of GC, a large delete, a large rename, HA failover, or a 
machine failure.
   > 
   > In these cases, NS handlers cannot consume requests fast enough, but the 
8888 handlers may continue accepting and enqueueing new requests. This can 
cause the unbounded queue to grow continuously and eventually trigger OOM.
   
   @ZanderXu Thanks very much for your explaination. Got it.  Will review this 
laterly.  Hi, @KeeProMise . also cc. What's your opiniones？ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDFS-17909. [ARR] AsyncRouterHandlerExecutors should use bounded queue [hadoop]

Reply via email to