Hmmmm. What this usually means is that the connection from the Solr instance to Zookeeper somehow times out. The first thing I’d be looking at are my GC logs, both on my Solr instances and my Zookeeper instances. If you have excessive stop-the-world times (15 seconds?) then that’d be the first thing I’d look at.
But I’ve seen these errors come on at various times that aren’t GC causes and never quite known where to start determining the cause, it becomes lots of detective work. Oh, and be sure to look three places: - the Zookeeper logs (besides GC) - the Solr log on the leader of the shard with the replica that fails to recover - the Solr log on the node that’s failing to recover. Best, Erick > On Apr 3, 2020, at 11:52 AM, Robbie Douglas <rld...@cornell.edu> wrote: > > Hello, > > We had an outage on one of our Solr nodes that we are trying to figure out. > Here's what came up in the Solr admin logs. 3 separate ones that I think > were in this order, but maybe not. > > Stopping recovery for core=[b1_shard5_replica_n16] > coreNodeName=[core_node19] > > Error while trying to recover. > core=b1_shard5_replica_n16:org.apache.solr.common.SolrException: Error while > saving shard term for collection: b1 > at > org.apache.solr.cloud.ZkShardTerms.saveTerms(ZkShardTerms.java:307) > at > org.apache.solr.cloud.ZkShardTerms.forceSaveTerms(ZkShardTerms.java:281) > at > org.apache.solr.cloud.ZkShardTerms.startRecovering(ZkShardTerms.java:227) > at > org.apache.solr.cloud.ZkController.publish(ZkController.java:1576) > at > org.apache.solr.cloud.ZkController.publish(ZkController.java:1500) > at > org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:577) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:326) > at > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:307) > at > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /collections/b1/terms/shard5 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1336) > at > org.apache.solr.common.cloud.SolrZkClient.lambda$setData$6(SolrZkClient.java:370) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71) > at > org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:370) > at > org.apache.solr.cloud.ZkShardTerms.saveTerms(ZkShardTerms.java:297) > ... 14 more > > Could not publish that recovery > failed:org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /overseer/queue > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1111) > at > org.apache.solr.common.cloud.SolrZkClient.lambda$exists$2(SolrZkClient.java:322) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71) > at > org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:322) > at > org.apache.solr.cloud.ZkDistributedQueue.offer(ZkDistributedQueue.java:309) > at > org.apache.solr.cloud.ZkController.publish(ZkController.java:1587) > at > org.apache.solr.cloud.ZkController.publish(ZkController.java:1500) > at > org.apache.solr.cloud.RecoveryStrategy.recoveryFailed(RecoveryStrategy.java:190) > at > org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:715) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:326) > at > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:307) > at > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > Solr is 8.1.1 with Zookeeper 3.4.9 deployed on the same nodes. > > Solr config looks like this. > > -DSTOP.KEY=solrrocks > -DSTOP.PORT=7983 > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.local.only=false > -Dcom.sun.management.jmxremote.port=18983 > -Dcom.sun.management.jmxremote.rmi.port=18983 > -Dcom.sun.management.jmxremote.ssl=false > -Djetty.home=/cul/app/solr/solr/server > -Djetty.port=8983 > -Dlog4j.configurationFile=file:/cul/data/solr/log4j2.xml > -Dsolr.data.home= > -Dsolr.default.confdir=/cul/app/solr/solr/server/solr/configsets/_default/conf > -Dsolr.install.dir=/cul/app/solr/solr > -Dsolr.jetty.https.port=8983 > -Dsolr.log.dir=/cul/data/solr/logs > -Dsolr.log.muteconsole > -Dsolr.solr.home=/cul/data/solr/data > -Duser.timezone=UTC > -DzkClientTimeout=15000 > -DzkHost=zk-host1:2181, zk-host2:2181, zk-host3:2181 > -XX:+AlwaysPreTouch > -XX:+ParallelRefProcEnabled > -XX:+PerfDisableSharedMem > -XX:+PrintGCApplicationStoppedTime > -XX:+PrintGCDateStamps > -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps > -XX:+PrintHeapAtGC > -XX:+PrintTenuringDistribution > -XX:+UseG1GC > -XX:+UseGCLogFileRotation > -XX:+UseLargePages > -XX:GCLogFileSize=20M > -XX:MaxGCPauseMillis=250 > -XX:NumberOfGCLogFiles=9 > -XX:OnOutOfMemoryError=/cul/app/solr/solr/bin/oom_solr.sh 8983 > /cul/data/solr/logs > -Xloggc:/cul/data/solr/logs/solr_gc.log > -Xms8g > -Xmx8g > -Xss256k > -verbose:gc > > > Any ideas on what to keep an eye on that would cause this would be greatly > appreciated. > > Thanks, > Robbie > > > > > > -- > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html