i think those followers are red from trying to forward requests to the overseer while it was being restarted. i guess i'll see if they become green over time. or i guess i can restart them one at a time..
On Fri, Mar 15, 2013 at 6:53 PM, Gary Yngve <gary.yn...@gmail.com> wrote: > it doesn't appear to be a shard1 vs shard11 issue... 60% of my followers > are red now in the solr cloud graph.. trying to figure out what that > means... > > > On Fri, Mar 15, 2013 at 6:48 PM, Gary Yngve <gary.yn...@gmail.com> wrote: > >> I restarted the overseer node and another took over, queues are empty now. >> >> the server with core production_things_shard1_2 >> is having these errors: >> >> shard update error RetryNode: >> http://10.104.59.189:8883/solr/production_things_shard11_replica1/:org.apache.solr.client.solrj.SolrServerException: >> Server refused connection at: >> http://10.104.59.189:8883/solr/production_things_shard11_replica1 >> >> for shard11!!! >> >> I also got some strange errors on the restarted node. Makes me wonder if >> there is a string-matching bug for shard1 vs shard11? >> >> SEVERE: :org.apache.solr.common.SolrException: Error getting leader from >> zk >> at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:771) >> at org.apache.solr.cloud.ZkController.register(ZkController.java:683) >> at org.apache.solr.cloud.ZkController.register(ZkController.java:634) >> at >> org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:890) >> at >> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:874) >> at org.apache.solr.core.CoreContainer.register(CoreContainer.java:823) >> at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:633) >> at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624) >> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:722) >> Caused by: org.apache.solr.common.SolrException: There is conflicting >> information about the leader >> of shard: shard1 our state says: >> http://10.104.59.189:8883/solr/collection1/ but zookeeper says:http >> ://10.217.55.151:8883/solr/collection1/ >> at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:756) >> >> INFO: Releasing >> directory:/vol/ubuntu/talemetry_match_solr/solr_server/solr/production_things_shar >> d11_replica1/data/index >> Mar 15, 2013 5:52:34 PM org.apache.solr.common.SolrException log >> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher >> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1423) >> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1535) >> >> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on >> state recovering for 10.76.31. >> 67:8883_solr but I still do not see the requested state. I see state: >> active live:true >> at >> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler >> .java:948) >> >> >> >> >> On Fri, Mar 15, 2013 at 5:05 PM, Mark Miller <markrmil...@gmail.com>wrote: >> >>> Strange - we hardened that loop in 4.1 - so I'm not sure what happened >>> here. >>> >>> Can you do a stack dump on the overseer and see if you see an Overseer >>> thread running perhaps? Or just post the results? >>> >>> To recover, you should be able to just restart the Overseer node and >>> have someone else take over - they should pick up processing the queue. >>> >>> Any logs you might be able to share could be useful too. >>> >>> - Mark >>> >>> On Mar 15, 2013, at 7:51 PM, Gary Yngve <gary.yn...@gmail.com> wrote: >>> >>> > Also, looking at overseer_elect, everything looks fine. node is valid >>> and >>> > live. >>> > >>> > >>> > On Fri, Mar 15, 2013 at 4:47 PM, Gary Yngve <gary.yn...@gmail.com> >>> wrote: >>> > >>> >> Sorry, should have specified. 4.1 >>> >> >>> >> >>> >> >>> >> >>> >> On Fri, Mar 15, 2013 at 4:33 PM, Mark Miller <markrmil...@gmail.com >>> >wrote: >>> >> >>> >>> What Solr version? 4.0, 4.1 4.2? >>> >>> >>> >>> - Mark >>> >>> >>> >>> On Mar 15, 2013, at 7:19 PM, Gary Yngve <gary.yn...@gmail.com> >>> wrote: >>> >>> >>> >>>> my solr cloud has been running fine for weeks, but about a week >>> ago, it >>> >>>> stopped dequeueing from the overseer queue, and now there are >>> thousands >>> >>> of >>> >>>> tasks on the queue, most which look like >>> >>>> >>> >>>> { >>> >>>> "operation":"state", >>> >>>> "numShards":null, >>> >>>> "shard":"shard3", >>> >>>> "roles":null, >>> >>>> "state":"recovering", >>> >>>> "core":"production_things_shard3_2", >>> >>>> "collection":"production_things", >>> >>>> "node_name":"10.31.41.59:8883_solr", >>> >>>> "base_url":"http://10.31.41.59:8883/solr"} >>> >>>> >>> >>>> i'm trying to create a new collection through collection API, and >>> >>>> obviously, nothing is happening... >>> >>>> >>> >>>> any suggestion on how to fix this? drop the queue in zk? >>> >>>> >>> >>>> how could did it have gotten in this state in the first place? >>> >>>> >>> >>>> thanks, >>> >>>> gary >>> >>> >>> >>> >>> >> >>> >>> >> >