Yukang-Lian opened a new pull request, #61951:
URL: https://github.com/apache/doris/pull/61951

   ## Summary
   
   When a follower FE's pooled Thrift connection to the master FE goes stale 
(common in cross-AZ K8S deployments due to NAT/LB idle timeouts), 
`GenericPool.reopen()` blocks for the full RPC timeout (default 1080s = 18 
minutes) before failing. This blocks all DDL forwarding, transaction 
operations, and token management that require master communication.
   
   **Root cause:** `borrowObject()` inflates both `connectTimeout_` and 
`socketTimeout_` via `TSocket.setTimeout()`. These field values survive the 
`close()`/`open()` cycle in `reopen()`, and the timeout restoration only 
happens *after* `open()` returns — too late to protect the connect and TLS 
handshake phases.
   
   **Fix (two layers):**
   - **Layer 1 — reopen() timeout protection:** Set a short connect timeout 
(`thrift_rpc_connect_timeout_ms`, default 10s) *before* `close()` to protect 
the subsequent `open()` phase, then restore the actual RPC timeout after 
successful `open()`. Must be done before `close()` because Thrift 0.16.0's 
`TSocket.close()` nulls the internal socket, and `setSocketTimeout()` would NPE 
after that.
   - **Layer 2 — batch cleanup on failure:** Add `reopenOrClear()` that clears 
all idle connections for the failed address when `reopen()` fails, preventing 
sequential drain of up to 128 stale connections.
   
   **Changes:**
   - `Config.java` — new config `thrift_rpc_connect_timeout_ms` (default 10s)
   - `GenericPool.java` — fix both `reopen()` overloads; add `reopenOrClear()` 
methods
   - `FEOpExecutor.java`, `MasterTxnExecutor.java`, `TokenManager.java`, 
`BrokerUtil.java` — use `reopenOrClear()`
   
   ## Test plan
   
   - [x] Unit tests: `GenericPoolTest` — 5 new test cases covering reopen 
timeout, no-arg reopen default timeout restoration, reopenOrClear 
success/failure paths, zero-timeout backward compatibility
   - [x] FE build: `build.sh --fe` passes
   - [x] Unit tests: `run-fe-ut.sh --run GenericPoolTest` passes
   - [ ] Docker integration test: `test_forward_reopen_timeout.groovy` — 2-FE 
cluster, restart master to create stale connections, verify DDL forwarding 
recovers within 60s instead of 1080s


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to