dang-stripe opened a new issue #7976:
URL: https://github.com/apache/pinot/issues/7976
We've noticed a case where brokers get stuck when they're interrupted via
SIGTERM when the broker resource is transitioning from OFFLINE to ONLINE
states. This seems to leave the broker in a stuck state indefinitely and
subsequent SIGTERMs are ignored. We end up needing to kill the process via
SIGKILL to recover it. Will pinot/helix retry state transitions on errors like
this?
Here's a log we found before this happened:
```
2022/01/06 00:15:47.311 ERROR [BrokerResourceOnlineOfflineStateModelFactory]
[HelixTaskExecutor-message_handle_thread] Caught exception while processing
transition from OFFLINE to ONLINE for table: test_table_REALTIME
org.I0Itec.zkclient.exception.ZkInterruptedException:
java.lang.InterruptedException
at
org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1202)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.helix.manager.zk.zookeeper.ZkClient.readData(ZkClient.java:1336)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.helix.manager.zk.zookeeper.ZkClient.readData(ZkClient.java:1328)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:320)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.get(ZkCacheBaseDataAccessor.java:390)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.helix.store.zk.AutoFallbackPropertyStore.get(AutoFallbackPropertyStore.java:101)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.pinot.common.metadata.ZKMetadataProvider.getTableConfig(ZKMetadataProvider.java:184)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.pinot.broker.routing.RoutingManager.buildRouting(RoutingManager.java:296)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.pinot.broker.broker.helix.BrokerResourceOnlineOfflineStateModelFactory$BrokerResourceOnlineOfflineStateModel.onBecomeOnlineFromOffline(BrokerResourceOnlineOfflineStateModelFactory.java:80)
[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method) ~[?:?]
at java.lang.Object.wait(Object.java:328) ~[?:?]
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2129)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2160)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.helix.manager.zk.zookeeper.ZkConnection.readData(ZkConnection.java:136)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.helix.manager.zk.zookeeper.ZkClient$10.call(ZkClient.java:1340)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.helix.manager.zk.zookeeper.ZkClient$10.call(ZkClient.java:1336)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
at
org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1190)
~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
... 20 more
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]