[
https://issues.apache.org/jira/browse/HBASE-28741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18074385#comment-18074385
]
Nick Dimiduk commented on HBASE-28741:
--------------------------------------
ConnectionRegistryRpcStubHolder.fetchClusterIdAndCreateStubs() can permanently
orphan its CompletableFuture. The listener callback's success path (creating
the RPC client and stubs) is not wrapped in a try-catch. If
RpcClientFactory.createClient() or createStubs() throws, the exception
propagates into FutureUtils.addListener's catch-all, which logs "Unexpected
error caught when processing CompletableFuture" but cannot complete the future.
Neither complete() nor completeExceptionally() is ever called, so every caller
up the chain hangs indefinitely. A second path exists: if ClusterIdFetcher
construction throws, the future is assigned to addr2StubFuture but the method
throws before returning, leaving addr2StubFuture pointing at a zombie future
that subsequent getStubs() calls will return.
Separately, ClusterIdFetcher creates its RPC channel with rpcTimeout=0. The
comment says timeout doesn't matter because it's "only a preamble connection
header," but the preamble still requires TCP connect and potentially TLS
negotiation, which can hang.
Both issues were observed together in production: a certificate authority
sidecar returning HTTP 500 caused the RPC client constructor to throw
UnsupportedOperationException, orphaning the future and hanging the connection
for 300 seconds until the framework killed the thread.
(written with AI)
> Rpc ConnectionRegistry APIs should have timeout
> -----------------------------------------------
>
> Key: HBASE-28741
> URL: https://issues.apache.org/jira/browse/HBASE-28741
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 2.6.0, 2.4.18, 2.5.10
> Reporter: Viraj Jasani
> Assignee: Nick Dimiduk
> Priority: Major
>
> ConnectionRegistry are some of the most basic metadata APIs that determine
> how clients can interact with the servers after getting required metadata.
> These APIs should timeout quickly if they cannot serve metadata in time.
> Similar to HBASE-28428 introducing timeout for Zookeeper ConnectionRegistry
> APIs, we should also introduce timeout (same timeout values) for Rpc
> ConnectionRegistry APIs as well. RpcConnectionRegistry uses HBase RPC
> framework with hedge read fanout mode.
> We have two options to introduce timeout:
> # Use RetryTimer to keep watch on CompletableFuture and make it complete
> exceptionally if timeout is reached (similar proposal as HBASE-28428).
> # Introduce separate Rpc timeout config for
> AbstractRpcBasedConnectionRegistry as the rpc timeout for generic RPC
> operations (hbase.rpc.timeout) could be higher.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)