Yukang-Lian opened a new pull request, #61951: URL: https://github.com/apache/doris/pull/61951
## Summary When a follower FE's pooled Thrift connection to the master FE goes stale (common in cross-AZ K8S deployments due to NAT/LB idle timeouts), `GenericPool.reopen()` blocks for the full RPC timeout (default 1080s = 18 minutes) before failing. This blocks all DDL forwarding, transaction operations, and token management that require master communication. **Root cause:** `borrowObject()` inflates both `connectTimeout_` and `socketTimeout_` via `TSocket.setTimeout()`. These field values survive the `close()`/`open()` cycle in `reopen()`, and the timeout restoration only happens *after* `open()` returns — too late to protect the connect and TLS handshake phases. **Fix (two layers):** - **Layer 1 — reopen() timeout protection:** Set a short connect timeout (`thrift_rpc_connect_timeout_ms`, default 10s) *before* `close()` to protect the subsequent `open()` phase, then restore the actual RPC timeout after successful `open()`. Must be done before `close()` because Thrift 0.16.0's `TSocket.close()` nulls the internal socket, and `setSocketTimeout()` would NPE after that. - **Layer 2 — batch cleanup on failure:** Add `reopenOrClear()` that clears all idle connections for the failed address when `reopen()` fails, preventing sequential drain of up to 128 stale connections. **Changes:** - `Config.java` — new config `thrift_rpc_connect_timeout_ms` (default 10s) - `GenericPool.java` — fix both `reopen()` overloads; add `reopenOrClear()` methods - `FEOpExecutor.java`, `MasterTxnExecutor.java`, `TokenManager.java`, `BrokerUtil.java` — use `reopenOrClear()` ## Test plan - [x] Unit tests: `GenericPoolTest` — 5 new test cases covering reopen timeout, no-arg reopen default timeout restoration, reopenOrClear success/failure paths, zero-timeout backward compatibility - [x] FE build: `build.sh --fe` passes - [x] Unit tests: `run-fe-ut.sh --run GenericPoolTest` passes - [ ] Docker integration test: `test_forward_reopen_timeout.groovy` — 2-FE cluster, restart master to create stale connections, verify DDL forwarding recovers within 60s instead of 1080s -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
