[
https://issues.apache.org/jira/browse/HBASE-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18075865#comment-18075865
]
Junegunn Choi commented on HBASE-30101:
---------------------------------------
h2. Note on HBase 2.6+
I ran into these problems on HBase 2.4, and the patch is crafted to target that
case: it closes the race between {{bind()}} and {{start()}} during which netty
workers can handle incoming connections and touch {{UserGroupInformation}}
before the main thread has logged in.
On HBase 2.6+, the situation is a little different.
[HBASE-28321|https://issues.apache.org/jira/browse/HBASE-28321] (PR
[#5688|https://github.com/apache/hbase/pull/5688]) added this line to the
{{RpcServer}} constructor:
{code:java}
serverPrincipal = Preconditions.checkNotNull(userProvider.getCurrentUserName(),
"can not get current user name when security is enabled");
{code}
{{userProvider.getCurrentUserName()}} calls
{{UserGroupInformation.getCurrentUser()}}, which picks up whatever principal is
in the ticket cache if the keytab login has not yet happened. UGI then spawns a
TGT Renewer for that principal, even when it does not match the principal the
server is configured to use:
{noformat}
2026-04-24T10:27:07,173 DEBUG [TGT Renewer for [email protected]]
security.UserGroupInformation: Current time is 1776994027173, next refresh is
1777028575000
{noformat}
{noformat}
Thread 28 (TGT Renewer for [email protected]):
State: TIMED_WAITING
Blocked count: 0
Waited count: 1
Stack:
[email protected]/java.lang.Thread.sleep(Native Method)
app//org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:982)
[email protected]/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
[email protected]/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[email protected]/java.lang.Thread.run(Thread.java:829)
{noformat}
Because this happens at construction time, before any inbound connection, the
connection-side fix in this patch cannot prevent the stray TGT Renewer on 2.6+.
A different mitigation (for example, clearing the ticket cache or overriding
{{KRB5CCNAME}} before the daemon starts) would be needed there.
> NettyRpcServer may accept connections before start() completes
> --------------------------------------------------------------
>
> Key: HBASE-30101
> URL: https://issues.apache.org/jira/browse/HBASE-30101
> Project: HBase
> Issue Type: Bug
> Reporter: Junegunn Choi
> Assignee: Junegunn Choi
> Priority: Minor
> Labels: pull-request-available
> Attachments: image-2026-04-22-13-47-59-850.png
>
>
> h2. Description
> {{NettyRpcServer}} binds the server socket in its constructor, but auth setup
> (SASL secret manager, authorization manager, scheduler) only runs in
> {{{}start(){}}}.
> In the window between bind and start, netty workers can accept connections
> and run handler code that reaches into {{UserGroupInformation}} before the
> main thread has finished Kerberos login.
> h2. Observed symptoms
> We observed two issues due to this window during server restarts.
> h3. Error logs from handler code running before login completes
> {noformat}
> 026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is hbase
> 2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is hbase
> 2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error
> when trying to create instance of HBaseSaslRpcServer with sasl provider:
> org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
> org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal
> does NOT contain an instance (hostname): hbase
> {noformat}
> h3. TGT Renewer thread spawned, continuously emitting warning logs
> {noformat}
> java.io.IOException: Cannot run program "kinit": error=2, No such file or
> directory
> at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
> at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
> at org.apache.hadoop.util.Shell.run(Shell.java:900)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
> at
> org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
> at
> org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> {noformat}
> We use keytabs, so this {{TGT Renewer}} thread should never be created, but
> when the keytab login is delayed, it is created due to the timing issue.
> {noformat}
> 2026-04-20 16:31:07,446 INFO [main] ipc.NettyRpcServer: Bind to
> /10.228.198.67:11471
> 2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer:
> Connection /10.227.143.27:31898; # active connections=0
> 2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer:
> Connection /10.192.163.175:2080; # active connections=1
> 2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer:
> Connection /10.227.225.174:25343; # active connections=2
> 2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer:
> Connection /10.192.163.189:32775; # active connections=3
> 2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer:
> Connection /10.192.163.203:43663; # active connections=4
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-6]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-4]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-2]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-3]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,559 DEBUG [RS-EventLoopGroup-1-5]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> 2026-04-20 16:31:07,563 TRACE [RS-EventLoopGroup-1-7] ipc.NettyRpcServer:
> Connection /10.228.110.148:20935; # active connections=5
> 2026-04-20 16:31:07,583 INFO [main] security.UserGroupInformation: Login
> successful for user ***/*** using keytab file hbase.keytab. Keytab auto
> renewal enabled : false
> 2026-04-20 16:31:07,587 DEBUG [RS-EventLoopGroup-1-7]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is ***/***
> {noformat}
> !image-2026-04-22-13-47-59-850.png|width=800!
> h2. Proposed fix
> Disable {{AUTO_READ}} on the server channel at bootstrap time so no
> connections are accepted until {{start()}} completes, then re-enable it at
> the end of {{{}start(){}}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)