[ 
https://issues.apache.org/jira/browse/HBASE-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junegunn Choi updated HBASE-30101:
----------------------------------
    Description: 
h2. Problem

The {{RpcServer}} can access {{UserGroupInformation}} before Kerberos login 
completes during startup. When this happens, UGI bootstraps from the ticket 
cache and spawns a stray {{TGT Renewer}} thread for whichever principal happens 
to be there, even when it does not match the principal the server is configured 
to use.

Two independent code paths trigger this:

h3. 1. NettyRpcServer accepts connections before start()

{{NettyRpcServer}} binds the server socket in its constructor, but auth setup 
(SASL secret manager, authorization manager, scheduler) only runs in 
{{start()}}. In that window, netty workers can accept connections and run 
handler code that reaches into UGI before the main thread has finished Kerberos 
login.

Fixed in [PR #8110|https://github.com/apache/hbase/pull/8110]: disable 
{{AUTO_READ}} on the server channel at bootstrap, re-enable at the end of 
{{start()}}.

h3. 2. RpcServer constructor reads UGI (HBase 2.6+)

Since [HBASE-28321|https://issues.apache.org/jira/browse/HBASE-28321] (PR 
[#5688|https://github.com/apache/hbase/pull/5688]), the {{RpcServer}} 
constructor calls:

{code:java}
serverPrincipal = Preconditions.checkNotNull(userProvider.getCurrentUserName(),
  "can not get current user name when security is enabled");
{code}

{{userProvider.getCurrentUserName()}} calls 
{{UserGroupInformation.getCurrentUser()}}, which picks up whatever principal is 
in the ticket cache if the keytab login has not yet happened. Because this 
happens at construction time, before any inbound connection, the 
connection-side fix in path 1 cannot prevent the stray {{TGT Renewer}} here.

Fixed in [PR #8122|https://github.com/apache/hbase/pull/8122]: resolve the 
hostname up front via {{DNS.getHostname}} and run ZK client and server logins 
before {{createRpcServices()}}.

h2. Observed symptoms

h3. HBase 2.4 (path 1)

Handler code error:

{code}
2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3] 
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name 
is hbase
2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2] 
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name 
is hbase
2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error when 
trying to create instance of HBaseSaslRpcServer with sasl provider: 
org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal does 
NOT contain an instance (hostname): hbase
{code}

Stray {{TGT Renewer}} thread, continuously emitting:

{code}
java.io.IOException: Cannot run program "kinit": error=2, No such file or 
directory
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
    at org.apache.hadoop.util.Shell.run(Shell.java:900)
    at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
    at 
org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
    at 
org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
{code}

Connections accepted before keytab login completes:

{code}
2026-04-20 16:31:07,446 INFO  [main] ipc.NettyRpcServer: Bind to 
/10.228.198.67:11471
2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer: 
Connection /10.227.143.27:31898; # active connections=0
2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer: 
Connection /10.192.163.175:2080; # active connections=1
2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer: 
Connection /10.227.225.174:25343; # active connections=2
2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer: 
Connection /10.192.163.189:32775; # active connections=3
2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer: 
Connection /10.192.163.203:43663; # active connections=4
2026-04-20 16:31:07,583 INFO  [main] security.UserGroupInformation: Login 
successful for user ***/*** using keytab file hbase.keytab. Keytab auto renewal 
enabled : false
{code}

h3. HBase 2.6+ (path 2)

Stray {{TGT Renewer}} for a principal that does not match the server's 
configured principal:

{noformat}
2026-04-24T10:27:07,173 DEBUG [TGT Renewer for [email protected]] 
security.UserGroupInformation: Current time is 1776994027173, next refresh is 
1777028575000
{noformat}

Thread dump showing the renewer spawned from the {{RpcServer}} constructor 
(before {{login()}}):

{noformat}
Thread 28 (TGT Renewer for [email protected]):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 1
  Stack:
    [email protected]/java.lang.Thread.sleep(Native Method)
    
app//org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:982)
    
[email protected]/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    [email protected]/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    
[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    
[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    [email protected]/java.lang.Thread.run(Thread.java:829)
{noformat}

  was:
h2. Description

{{NettyRpcServer}} binds the server socket in its constructor, but auth setup 
(SASL secret manager, authorization manager, scheduler) only runs in 
{{{}start(){}}}.

In the window between bind and start, netty workers can accept connections and 
run handler code that reaches into {{UserGroupInformation}} before the main 
thread has finished Kerberos login.
h2. Observed symptoms

We observed two issues due to this window during server restarts.
h3. Error logs from handler code running before login completes
{noformat}
026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3] 
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name 
is hbase
2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2] 
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name 
is hbase
2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error when 
trying to create instance of HBaseSaslRpcServer with sasl provider: 
org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal does 
NOT contain an instance (hostname): hbase
{noformat}
h3. TGT Renewer thread spawned, continuously emitting warning logs
{noformat}
java.io.IOException: Cannot run program "kinit": error=2, No such file or 
directory
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
    at org.apache.hadoop.util.Shell.run(Shell.java:900)
    at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
    at 
org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
    at 
org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
{noformat}
We use keytabs, so this {{TGT Renewer}} thread should never be created, but 
when the keytab login is delayed, it is created due to the timing issue.
{noformat}
2026-04-20 16:31:07,446 INFO  [main] ipc.NettyRpcServer: Bind to 
/10.228.198.67:11471
2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer: 
Connection /10.227.143.27:31898; # active connections=0
2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer: 
Connection /10.192.163.175:2080; # active connections=1
2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer: 
Connection /10.227.225.174:25343; # active connections=2
2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer: 
Connection /10.192.163.189:32775; # active connections=3
2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer: 
Connection /10.192.163.203:43663; # active connections=4
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-6] 
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name 
is ***/***
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-4] 
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name 
is ***/***
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-2] 
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name 
is ***/***
2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-3] 
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name 
is ***/***
2026-04-20 16:31:07,559 DEBUG [RS-EventLoopGroup-1-5] 
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name 
is ***/***
2026-04-20 16:31:07,563 TRACE [RS-EventLoopGroup-1-7] ipc.NettyRpcServer: 
Connection /10.228.110.148:20935; # active connections=5
2026-04-20 16:31:07,583 INFO  [main] security.UserGroupInformation: Login 
successful for user ***/*** using keytab file hbase.keytab. Keytab auto renewal 
enabled : false
2026-04-20 16:31:07,587 DEBUG [RS-EventLoopGroup-1-7] 
provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal name 
is ***/***
{noformat}

 !image-2026-04-22-13-47-59-850.png|width=800! 

h2. Proposed fix

Disable {{AUTO_READ}} on the server channel at bootstrap time so no connections 
are accepted until {{start()}} completes, then re-enable it at the end of 
{{{}start(){}}}.


> Stray TGT Renewer from RpcServer accessing UGI before Kerberos login
> --------------------------------------------------------------------
>
>                 Key: HBASE-30101
>                 URL: https://issues.apache.org/jira/browse/HBASE-30101
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: image-2026-04-22-13-47-59-850.png
>
>
> h2. Problem
> The {{RpcServer}} can access {{UserGroupInformation}} before Kerberos login 
> completes during startup. When this happens, UGI bootstraps from the ticket 
> cache and spawns a stray {{TGT Renewer}} thread for whichever principal 
> happens to be there, even when it does not match the principal the server is 
> configured to use.
> Two independent code paths trigger this:
> h3. 1. NettyRpcServer accepts connections before start()
> {{NettyRpcServer}} binds the server socket in its constructor, but auth setup 
> (SASL secret manager, authorization manager, scheduler) only runs in 
> {{start()}}. In that window, netty workers can accept connections and run 
> handler code that reaches into UGI before the main thread has finished 
> Kerberos login.
> Fixed in [PR #8110|https://github.com/apache/hbase/pull/8110]: disable 
> {{AUTO_READ}} on the server channel at bootstrap, re-enable at the end of 
> {{start()}}.
> h3. 2. RpcServer constructor reads UGI (HBase 2.6+)
> Since [HBASE-28321|https://issues.apache.org/jira/browse/HBASE-28321] (PR 
> [#5688|https://github.com/apache/hbase/pull/5688]), the {{RpcServer}} 
> constructor calls:
> {code:java}
> serverPrincipal = 
> Preconditions.checkNotNull(userProvider.getCurrentUserName(),
>   "can not get current user name when security is enabled");
> {code}
> {{userProvider.getCurrentUserName()}} calls 
> {{UserGroupInformation.getCurrentUser()}}, which picks up whatever principal 
> is in the ticket cache if the keytab login has not yet happened. Because this 
> happens at construction time, before any inbound connection, the 
> connection-side fix in path 1 cannot prevent the stray {{TGT Renewer}} here.
> Fixed in [PR #8122|https://github.com/apache/hbase/pull/8122]: resolve the 
> hostname up front via {{DNS.getHostname}} and run ZK client and server logins 
> before {{createRpcServices()}}.
> h2. Observed symptoms
> h3. HBase 2.4 (path 1)
> Handler code error:
> {code}
> 2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is hbase
> 2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is hbase
> 2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error 
> when trying to create instance of HBaseSaslRpcServer with sasl provider: 
> org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
> org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal 
> does NOT contain an instance (hostname): hbase
> {code}
> Stray {{TGT Renewer}} thread, continuously emitting:
> {code}
> java.io.IOException: Cannot run program "kinit": error=2, No such file or 
> directory
>     at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
>     at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
>     at org.apache.hadoop.util.Shell.run(Shell.java:900)
>     at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
>     at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
>     at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
>     at 
> org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
>     at 
> org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
> {code}
> Connections accepted before keytab login completes:
> {code}
> 2026-04-20 16:31:07,446 INFO  [main] ipc.NettyRpcServer: Bind to 
> /10.228.198.67:11471
> 2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer: 
> Connection /10.227.143.27:31898; # active connections=0
> 2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer: 
> Connection /10.192.163.175:2080; # active connections=1
> 2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer: 
> Connection /10.227.225.174:25343; # active connections=2
> 2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer: 
> Connection /10.192.163.189:32775; # active connections=3
> 2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer: 
> Connection /10.192.163.203:43663; # active connections=4
> 2026-04-20 16:31:07,583 INFO  [main] security.UserGroupInformation: Login 
> successful for user ***/*** using keytab file hbase.keytab. Keytab auto 
> renewal enabled : false
> {code}
> h3. HBase 2.6+ (path 2)
> Stray {{TGT Renewer}} for a principal that does not match the server's 
> configured principal:
> {noformat}
> 2026-04-24T10:27:07,173 DEBUG [TGT Renewer for [email protected]] 
> security.UserGroupInformation: Current time is 1776994027173, next refresh is 
> 1777028575000
> {noformat}
> Thread dump showing the renewer spawned from the {{RpcServer}} constructor 
> (before {{login()}}):
> {noformat}
> Thread 28 (TGT Renewer for [email protected]):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 1
>   Stack:
>     [email protected]/java.lang.Thread.sleep(Native Method)
>     
> app//org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:982)
>     
> [email protected]/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>     [email protected]/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     
> [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     
> [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     [email protected]/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to