Hi Li I got into very similar situation like you. The GC was taking much longer than the zookeeper timeout configured. I had 3 nodes in the SolrCloud and very often I would have my entire cluster totally messed up. Increasing the zookeeper timeout eventually helped. But before that, I was able do some temporary workaround by "rmr /solr/overseer/queue” in the zookeeper (not sure whether I restarted the solr after that). I am not even sure this is the right thing to do, but seem to have unblocked me at time. At least, there were no negative effect.
Thanks Bosco On 4/29/16, 7:52 AM, "Erick Erickson" <erickerick...@gmail.com> wrote: >Well, there have been lots of improvements since 4.6. You're right, >logically when things come back up and are all reachable, it seems >like it is theoretically possible to bring a node back up. There >have been situations where that doesn't happen, and various fixes >have been implemented to fix them as they're identified.... > >You might try reloading the core from the core admin (that's >about the only thing you should try in SolrCloud from the >core admin screen).... > >Best, >Erick > >On Wed, Apr 27, 2016 at 10:58 AM, Li Ding <li.d...@bloomreach.com> wrote: >> Hi Erick, >> >> I don't have the GC log. But after the GC finished. Isn't zk ping >> succeeds and the core should be back to normal state? From the log I >> posted. The sequence is: >> >> 1) Solr Detects itself can't connect to ZK and reconnect to ZK >> 2) Solr marked all cores are down >> 3) Solr recovery each cores, some succeeds, some failed. >> 4) After 30 minutes, the cores that are failed still marked as down. >> >> So my questions is, during the 30 minutes interval, if GC takes too long, >> all cores should failed. And GC doesn't take longer than a minute since >> all serving requests to other calls succeeds and the next zk ping should >> bring the core back to normal? right? We have an active monitor running at >> the same time querying every core in distrib=false mode and every query >> succeeds. >> >> Thanks, >> >> Li >> >> On Tue, Apr 26, 2016 at 6:20 PM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >>> One of the reasons this happens is if you have very >>> long GC cycles, longer than the Zookeeper "keep alive" >>> timeout. During a full GC pause, Solr is unresponsive and >>> if the ZK ping times out, ZK assumes the machine is >>> gone and you get into this recovery state. >>> >>> So I'd collect GC logs and see if you have any >>> stop-the-world GC pauses that take longer than the ZK >>> timeout. >>> >>> see Mark Millers primer on GC here: >>> https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/ >>> >>> Best, >>> Erick >>> >>> On Tue, Apr 26, 2016 at 2:13 PM, Li Ding <li.d...@bloomreach.com> wrote: >>> > Thank you all for your help! >>> > >>> > The zookeeper log rolled over, thisis from Solr.log: >>> > >>> > Looks like the solr and zk connection is gone for some reason >>> > >>> > INFO - 2016-04-21 12:37:57.536; >>> > org.apache.solr.common.cloud.ConnectionManager; Watcher >>> > org.apache.solr.common.cloud.ConnectionManager@19789a96 >>> > name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent >>> > state:Disconnected type:None path:null path:null type:None >>> > >>> > INFO - 2016-04-21 12:37:57.536; >>> > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected >>> > >>> > INFO - 2016-04-21 12:38:24.248; >>> > org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection >>> expired >>> > - starting a new one... >>> > >>> > INFO - 2016-04-21 12:38:24.262; >>> > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to >>> > connect to ZooKeeper >>> > >>> > INFO - 2016-04-21 12:38:24.269; >>> > org.apache.solr.common.cloud.ConnectionManager; Connected:true >>> > >>> > >>> > Then it publishes all cores on the hosts are down. I just list three >>> cores >>> > here: >>> > >>> > INFO - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController; >>> > publishing core=product1_shard1_replica1 state=down >>> > >>> > INFO - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController; >>> > publishing core=collection1 state=down >>> > >>> > INFO - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController; >>> > numShards not found on descriptor - reading it from system property >>> > >>> > INFO - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController; >>> > publishing core=product2_shard5_replica1 state=down >>> > >>> > INFO - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController; >>> > publishing core=product2_shard13_replica1 state=down >>> > >>> > >>> > product1 has only one shard one replica and it's able to be active >>> > successfully: >>> > >>> > INFO - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController; >>> > Register replica - core:product1_shard1_replica1 address:http:// >>> > {internalIp}:8983/solr collection:product1 shard:shard1 >>> > >>> > WARN - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext; >>> > cancelElection did not find election node to remove >>> > >>> > INFO - 2016-04-21 12:38:26.393; >>> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader >>> > process for shard shard1 >>> > >>> > INFO - 2016-04-21 12:38:26.399; >>> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found >>> to >>> > continue. >>> > >>> > INFO - 2016-04-21 12:38:26.399; >>> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new >>> leader - >>> > try and sync >>> > >>> > INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync >>> > replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/ >>> > >>> > INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync >>> > Success - now sync replicas to me >>> > >>> > INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; >>> > http://{internalIp}:8983/solr/product1_shard1_replica1/ >>> > has no replicas >>> > >>> > INFO - 2016-04-21 12:38:26.399; >>> > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: >>> > http://{internalIp}:8983/solr/product1_shard1_replica1/ shard1 >>> > >>> > INFO - 2016-04-21 12:38:26.399; >>> org.apache.solr.common.cloud.SolrZkClient; >>> > makePath: /collections/product1/leaders/shard1 >>> > >>> > INFO - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; We >>> are >>> > http://{internalIp}:8983/solr/product1_shard1_replica1/ and leader is >>> > http://{internalIp}:8983/solr/product1_shard1_replica1/ >>> > >>> > INFO - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; No >>> > LogReplay needed for core=product1_replica1 baseURL=http:// >>> > {internalIp}:8983/solr >>> > >>> > INFO - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; I am >>> > the leader, no recovery necessary >>> > >>> > INFO - 2016-04-21 12:38:26.413; org.apache.solr.cloud.ZkController; >>> > publishing core=product1_shard1_replica1 state=active >>> > >>> > >>> > product2 has 15 shards one replica but only two shards lived on this >>> > machine, this is one of the failed shard that I never seen the message of >>> > the core product2_shard5_replica1 active: >>> > >>> > INFO - 2016-04-21 12:38:26.616; org.apache.solr.cloud.ZkController; >>> > Register replica - product2_shard5_replica1 address:http:// >>> > {internalIp}:8983/solr collection:product2 shard:shard5 >>> > >>> > WARN - 2016-04-21 12:38:26.618; org.apache.solr.cloud.ElectionContext; >>> > cancelElection did not find election node to remove >>> > >>> > INFO - 2016-04-21 12:38:26.625; >>> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader >>> > process for shard shard5 >>> > >>> > INFO - 2016-04-21 12:38:26.631; >>> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found >>> to >>> > continue. >>> > >>> > INFO - 2016-04-21 12:38:26.631; >>> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new >>> leader - >>> > try and sync >>> > >>> > INFO - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync >>> > replicas to http:// >>> > {internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ >>> > >>> > INFO - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync >>> > Success - now sync replicas to me >>> > >>> > INFO - 2016-04-21 12:38:26.632; org.apache.solr.cloud.SyncStrategy; >>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ >>> > has no replicas >>> > >>> > INFO - 2016-04-21 12:38:26.632; >>> > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: >>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ >>> > shard5 >>> > >>> > INFO - 2016-04-21 12:38:26.632; >>> org.apache.solr.common.cloud.SolrZkClient; >>> > makePath: /collections/product2_shard5_replica1/leaders/shard5 >>> > >>> > INFO - 2016-04-21 12:38:26.645; org.apache.solr.cloud.ZkController; We >>> are >>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ >>> and >>> > leader is http://{internalIp}:8983/solr >>> > product2_shard5_replica1_shard5_replica1/ >>> > >>> > INFO - 2016-04-21 12:38:26.646; >>> > org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from >>> > ZooKeeper... >>> > >>> > >>> > Before I restarted this server, a bunch of queries failed for this >>> > collection product2. But I don't think it will affect the core status. >>> > >>> > >>> > Do you guys have any idea about why this particular core is not published >>> > as active since from the log, most steps are done except the very last >>> one >>> > to publish info to ZK. >>> > >>> > >>> > Thanks, >>> > >>> > >>> > Li >>> > On Thu, Apr 21, 2016 at 7:08 AM, Rajesh Hazari <rajeshhaz...@gmail.com> >>> > wrote: >>> > >>> >> Hi Li, >>> >> >>> >> Do you see timeouts liek "CLUSTERSTATUS the collection time out:180s" >>> >> if its the case, this may be related to >>> >> https://issues.apache.org/jira/browse/SOLR-7940, >>> >> and i would say either use the patch file or upgrade. >>> >> >>> >> >>> >> *Thanks,* >>> >> *Rajesh,* >>> >> *8328789519,* >>> >> *If I don't answer your call please leave a voicemail with your contact >>> >> info, * >>> >> *will return your call ASAP.* >>> >> >>> >> On Thu, Apr 21, 2016 at 6:02 AM, YouPeng Yang < >>> yypvsxf19870...@gmail.com> >>> >> wrote: >>> >> >>> >> > Hi >>> >> > We have used Solr4.6 for 2 years,If you post more logs ,maybe we >>> can >>> >> > fixed it. >>> >> > >>> >> > 2016-04-21 6:50 GMT+08:00 Li Ding <li.d...@bloomreach.com>: >>> >> > >>> >> > > Hi All, >>> >> > > >>> >> > > We are using SolrCloud 4.6.1. We have observed following behaviors >>> >> > > recently. A Solr node in a Solrcloud cluster is up but some of the >>> >> cores >>> >> > > on the nodes are marked as down in Zookeeper. If the cores are >>> parts >>> >> of >>> >> > a >>> >> > > multi-sharded collection with one replica, the queries to that >>> >> > collection >>> >> > > will fail. However, when this happened, if we issue queries to the >>> >> core >>> >> > > directly, it returns 200 and correct info. But once Solr got into >>> the >>> >> > > state, the core will be marked down forever unless we do a restart >>> on >>> >> > Solr. >>> >> > > >>> >> > > Has anyone seen this behavior before? Is there any to get out of >>> the >>> >> > state >>> >> > > on its own? >>> >> > > >>> >> > > Thanks, >>> >> > > >>> >> > > Li >>> >> > > >>> >> > >>> >> >>>