dang-stripe opened a new issue #7976:
URL: https://github.com/apache/pinot/issues/7976


   We've noticed a case where brokers get stuck when they're interrupted via 
SIGTERM when the broker resource is transitioning from OFFLINE to ONLINE 
states. This seems to leave the broker in a stuck state indefinitely and 
subsequent SIGTERMs are ignored. We end up needing to kill the process via 
SIGKILL to recover it. Will pinot/helix retry state transitions on errors like 
this?
   
   Here's a log we found before this happened:
   ```
   2022/01/06 00:15:47.311 ERROR [BrokerResourceOnlineOfflineStateModelFactory] 
[HelixTaskExecutor-message_handle_thread] Caught exception while processing 
transition from OFFLINE to ONLINE for table: test_table_REALTIME
   org.I0Itec.zkclient.exception.ZkInterruptedException: 
java.lang.InterruptedException
           at 
org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1202)
 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at 
org.apache.helix.manager.zk.zookeeper.ZkClient.readData(ZkClient.java:1336) 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at 
org.apache.helix.manager.zk.zookeeper.ZkClient.readData(ZkClient.java:1328) 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at 
org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:320) 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at 
org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.get(ZkCacheBaseDataAccessor.java:390)
 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at 
org.apache.helix.store.zk.AutoFallbackPropertyStore.get(AutoFallbackPropertyStore.java:101)
 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at 
org.apache.pinot.common.metadata.ZKMetadataProvider.getTableConfig(ZKMetadataProvider.java:184)
 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at 
org.apache.pinot.broker.routing.RoutingManager.buildRouting(RoutingManager.java:296)
 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
           at 
org.apache.pinot.broker.broker.helix.BrokerResourceOnlineOfflineStateModelFactory$BrokerResourceOnlineOfflineStateModel.onBecomeOnlineFromOffline(BrokerResourceOnlineOfflineStateModelFactory.java:80)
 
[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
   Caused by: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method) ~[?:?]
        at java.lang.Object.wait(Object.java:328) ~[?:?]
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529) 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512) 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2129) 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2160) 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at 
org.apache.helix.manager.zk.zookeeper.ZkConnection.readData(ZkConnection.java:136)
 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at 
org.apache.helix.manager.zk.zookeeper.ZkClient$10.call(ZkClient.java:1340) 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at 
org.apache.helix.manager.zk.zookeeper.ZkClient$10.call(ZkClient.java:1336) 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at 
org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1190)
 
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        ... 20 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to