[ 
https://issues.apache.org/jira/browse/HBASE-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18075865#comment-18075865
 ] 

Junegunn Choi commented on HBASE-30101:
---------------------------------------

h2. Note on HBase 2.6+

I ran into these problems on HBase 2.4, and the patch is crafted to target that 
case: it closes the race between {{bind()}} and {{start()}} during which netty 
workers can handle incoming connections and touch {{UserGroupInformation}} 
before the main thread has logged in.

On HBase 2.6+, the situation is a little different. 
[HBASE-28321|https://issues.apache.org/jira/browse/HBASE-28321] (PR 
[#5688|https://github.com/apache/hbase/pull/5688]) added this line to the 
{{RpcServer}} constructor:

{code:java}
serverPrincipal = Preconditions.checkNotNull(userProvider.getCurrentUserName(),
  "can not get current user name when security is enabled");
{code}

{{userProvider.getCurrentUserName()}} calls 
{{UserGroupInformation.getCurrentUser()}}, which picks up whatever principal is 
in the ticket cache if the keytab login has not yet happened. UGI then spawns a 
TGT Renewer for that principal, even when it does not match the principal the 
server is configured to use:

{noformat}
2026-04-24T10:27:07,173 DEBUG [TGT Renewer for [email protected]] 
security.UserGroupInformation: Current time is 1776994027173, next refresh is 
1777028575000
{noformat}

{noformat}
Thread 28 (TGT Renewer for [email protected]):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 1
  Stack:
    [email protected]/java.lang.Thread.sleep(Native Method)
    
app//org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:982)
    
[email protected]/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    [email protected]/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    
[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    
[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    [email protected]/java.lang.Thread.run(Thread.java:829)
{noformat}

Because this happens at construction time, before any inbound connection, the 
connection-side fix in this patch cannot prevent the stray TGT Renewer on 2.6+. 
A different mitigation (for example, clearing the ticket cache or overriding 
{{KRB5CCNAME}} before the daemon starts) would be needed there.


> NettyRpcServer may accept connections before start() completes
> --------------------------------------------------------------
>
>                 Key: HBASE-30101
>                 URL: https://issues.apache.org/jira/browse/HBASE-30101
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: image-2026-04-22-13-47-59-850.png
>
>
> h2. Description
> {{NettyRpcServer}} binds the server socket in its constructor, but auth setup 
> (SASL secret manager, authorization manager, scheduler) only runs in 
> {{{}start(){}}}.
> In the window between bind and start, netty workers can accept connections 
> and run handler code that reaches into {{UserGroupInformation}} before the 
> main thread has finished Kerberos login.
> h2. Observed symptoms
> We observed two issues due to this window during server restarts.
> h3. Error logs from handler code running before login completes
> {noformat}
> 026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is hbase
> 2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is hbase
> 2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error 
> when trying to create instance of HBaseSaslRpcServer with sasl provider: 
> org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
> org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal 
> does NOT contain an instance (hostname): hbase
> {noformat}
> h3. TGT Renewer thread spawned, continuously emitting warning logs
> {noformat}
> java.io.IOException: Cannot run program "kinit": error=2, No such file or 
> directory
>     at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
>     at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
>     at org.apache.hadoop.util.Shell.run(Shell.java:900)
>     at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
>     at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
>     at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
>     at 
> org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
>     at 
> org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
>     at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>     at java.base/java.lang.Thread.run(Thread.java:1583)
> {noformat}
> We use keytabs, so this {{TGT Renewer}} thread should never be created, but 
> when the keytab login is delayed, it is created due to the timing issue.
> {noformat}
> 2026-04-20 16:31:07,446 INFO  [main] ipc.NettyRpcServer: Bind to 
> /10.228.198.67:11471
> 2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer: 
> Connection /10.227.143.27:31898; # active connections=0
> 2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer: 
> Connection /10.192.163.175:2080; # active connections=1
> 2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer: 
> Connection /10.227.225.174:25343; # active connections=2
> 2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer: 
> Connection /10.192.163.189:32775; # active connections=3
> 2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer: 
> Connection /10.192.163.203:43663; # active connections=4
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-6] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-4] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-2] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> 2026-04-20 16:31:07,557 DEBUG [RS-EventLoopGroup-1-3] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> 2026-04-20 16:31:07,559 DEBUG [RS-EventLoopGroup-1-5] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> 2026-04-20 16:31:07,563 TRACE [RS-EventLoopGroup-1-7] ipc.NettyRpcServer: 
> Connection /10.228.110.148:20935; # active connections=5
> 2026-04-20 16:31:07,583 INFO  [main] security.UserGroupInformation: Login 
> successful for user ***/*** using keytab file hbase.keytab. Keytab auto 
> renewal enabled : false
> 2026-04-20 16:31:07,587 DEBUG [RS-EventLoopGroup-1-7] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is ***/***
> {noformat}
>  !image-2026-04-22-13-47-59-850.png|width=800! 
> h2. Proposed fix
> Disable {{AUTO_READ}} on the server channel at bootstrap time so no 
> connections are accepted until {{start()}} completes, then re-enable it at 
> the end of {{{}start(){}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to