The biggest win I've seen for stability of hadoop components is to give them their own hard disks; or alternatively their own hosts.
Obviously, you'll also want to check the usual suspects or resource and processor contention. On Wed, May 4, 2016 at 3:59 PM, Anandha L Ranganathan <[email protected] > wrote: > The RM is keep going down and here is the error message we are getting. > How do we fix the issue ? > > > ZK and RM are on the same host . > > > > > 2016-05-04 19:17:36,132 INFO resourcemanager.RMAppManager > (RMAppManager.java:checkAppNumCompletedLimit(247)) - Max number of > completed apps kept in state store met: maxCompletedAppsInStateStore = > 10000, removing app application_1452798563961_0972 from state store. > > 2016-05-04 19:17:42,751 INFO zookeeper.ClientCnxn > (ClientCnxn.java:run(1096)) - Client session timed out, have not heard from > server in 6668ms for sessionid 0x5547d33e8480000, closing socket connection > and attempting reconnect > > 2016-05-04 19:17:42,851 INFO recovery.ZKRMStateStore > (ZKRMStateStore.java:runWithRetries(1110)) - Exception while executing a ZK > operation. > > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss > > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > > at > org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) > > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:937) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:934) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1076) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1097) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:934) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:948) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:965) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationStateInternal(ZKRMStateStore.java:655) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:163) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:148) > > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:810) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:864) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:859) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > > at java.lang.Thread.run(Thread.java:745) > > 2016-05-04 19:17:42,851 INFO recovery.ZKRMStateStore > (ZKRMStateStore.java:runWithRetries(1112)) - Retrying operation on ZK. > Retry no. 1 > > 2016-05-04 19:17:42,964 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to > server ip-10-0-83-40.us-west-2.compute.internal/10.0.83.40:2181. Will not > attempt to authenticate using SASL (unknown error) > > 2016-05-04 19:17:42,965 INFO zookeeper.ClientCnxn > (ClientCnxn.java:primeConnection(852)) - Socket connection established to > ip-10-0-83-40.us-west-2.compute.internal/10.0.83.40:2181, initiating > session > > 2016-05-04 19:17:42,969 INFO zookeeper.ClientCnxn > (ClientCnxn.java:onConnected(1235)) - Session establishment complete on > server ip-10-0-83-40.us-west-2.compute.internal/10.0.83.40:2181, > sessionid = 0x5547d33e8480000, negotiated timeout = 10000 > > 2016-05-04 19:17:42,991 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1102)) - Session 0x5547d33e8480000 for server > ip-10-0-83-40.us-west-2.compute.internal/10.0.83.40:2181, unexpected > error, closing socket connection and attempting reconnect > > java.io.IOException: Broken pipe > > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > > at > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > > at > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > -- Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> Latest news <http://www.handy.com/press> at Handy Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity
