I will upgrade to 4.2 this weekend and see what happens. We are on ec2 and have had a few issues with hostnames with both zk and solr. (but in this case i haven't rebooted any instances either)
it's relatively a pain to do the upgrade because we have a query/scorer fork of lucene along with supplemental jars, and zk cannot distribute binary jars via the config. we are also multi-collection per zk... i wish it didn't require a core always defined up front for the core admin? i would love to have an instance have no cores and then just create the core i need.. -g On Fri, Mar 15, 2013 at 7:14 PM, Mark Miller <markrmil...@gmail.com> wrote: > > On Mar 15, 2013, at 10:04 PM, Gary Yngve <gary.yn...@gmail.com> wrote: > > > i think those followers are red from trying to forward requests to the > > overseer while it was being restarted. i guess i'll see if they become > > green over time. or i guess i can restart them one at a time.. > > Restarting the cluster clear things up. It shouldn't take too long for > those nodes to recover though - they should have been up to date before. > The couple exceptions you posted def indicate something is out of whack. > It's something I'd like to get to the bottom of. > > - Mark > > > > > > > On Fri, Mar 15, 2013 at 6:53 PM, Gary Yngve <gary.yn...@gmail.com> > wrote: > > > >> it doesn't appear to be a shard1 vs shard11 issue... 60% of my followers > >> are red now in the solr cloud graph.. trying to figure out what that > >> means... > >> > >> > >> On Fri, Mar 15, 2013 at 6:48 PM, Gary Yngve <gary.yn...@gmail.com> > wrote: > >> > >>> I restarted the overseer node and another took over, queues are empty > now. > >>> > >>> the server with core production_things_shard1_2 > >>> is having these errors: > >>> > >>> shard update error RetryNode: > >>> > http://10.104.59.189:8883/solr/production_things_shard11_replica1/:org.apache.solr.client.solrj.SolrServerException > : > >>> Server refused connection at: > >>> http://10.104.59.189:8883/solr/production_things_shard11_replica1 > >>> > >>> for shard11!!! > >>> > >>> I also got some strange errors on the restarted node. Makes me wonder > if > >>> there is a string-matching bug for shard1 vs shard11? > >>> > >>> SEVERE: :org.apache.solr.common.SolrException: Error getting leader > from > >>> zk > >>> at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:771) > >>> at org.apache.solr.cloud.ZkController.register(ZkController.java:683) > >>> at org.apache.solr.cloud.ZkController.register(ZkController.java:634) > >>> at > >>> org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:890) > >>> at > >>> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:874) > >>> at org.apache.solr.core.CoreContainer.register(CoreContainer.java:823) > >>> at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:633) > >>> at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624) > >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166) > >>> at > >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166) > >>> at > >>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > >>> at > >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > >>> at java.lang.Thread.run(Thread.java:722) > >>> Caused by: org.apache.solr.common.SolrException: There is conflicting > >>> information about the leader > >>> of shard: shard1 our state says: > >>> http://10.104.59.189:8883/solr/collection1/ but zookeeper says:http > >>> ://10.217.55.151:8883/solr/collection1/ > >>> at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:756) > >>> > >>> INFO: Releasing > >>> > directory:/vol/ubuntu/talemetry_match_solr/solr_server/solr/production_things_shar > >>> d11_replica1/data/index > >>> Mar 15, 2013 5:52:34 PM org.apache.solr.common.SolrException log > >>> SEVERE: org.apache.solr.common.SolrException: Error opening new > searcher > >>> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1423) > >>> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1535) > >>> > >>> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on > >>> state recovering for 10.76.31. > >>> 67:8883_solr but I still do not see the requested state. I see state: > >>> active live:true > >>> at > >>> > org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler > >>> .java:948) > >>> > >>> > >>> > >>> > >>> On Fri, Mar 15, 2013 at 5:05 PM, Mark Miller <markrmil...@gmail.com > >wrote: > >>> > >>>> Strange - we hardened that loop in 4.1 - so I'm not sure what happened > >>>> here. > >>>> > >>>> Can you do a stack dump on the overseer and see if you see an Overseer > >>>> thread running perhaps? Or just post the results? > >>>> > >>>> To recover, you should be able to just restart the Overseer node and > >>>> have someone else take over - they should pick up processing the > queue. > >>>> > >>>> Any logs you might be able to share could be useful too. > >>>> > >>>> - Mark > >>>> > >>>> On Mar 15, 2013, at 7:51 PM, Gary Yngve <gary.yn...@gmail.com> wrote: > >>>> > >>>>> Also, looking at overseer_elect, everything looks fine. node is > valid > >>>> and > >>>>> live. > >>>>> > >>>>> > >>>>> On Fri, Mar 15, 2013 at 4:47 PM, Gary Yngve <gary.yn...@gmail.com> > >>>> wrote: > >>>>> > >>>>>> Sorry, should have specified. 4.1 > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Fri, Mar 15, 2013 at 4:33 PM, Mark Miller <markrmil...@gmail.com > >>>>> wrote: > >>>>>> > >>>>>>> What Solr version? 4.0, 4.1 4.2? > >>>>>>> > >>>>>>> - Mark > >>>>>>> > >>>>>>> On Mar 15, 2013, at 7:19 PM, Gary Yngve <gary.yn...@gmail.com> > >>>> wrote: > >>>>>>> > >>>>>>>> my solr cloud has been running fine for weeks, but about a week > >>>> ago, it > >>>>>>>> stopped dequeueing from the overseer queue, and now there are > >>>> thousands > >>>>>>> of > >>>>>>>> tasks on the queue, most which look like > >>>>>>>> > >>>>>>>> { > >>>>>>>> "operation":"state", > >>>>>>>> "numShards":null, > >>>>>>>> "shard":"shard3", > >>>>>>>> "roles":null, > >>>>>>>> "state":"recovering", > >>>>>>>> "core":"production_things_shard3_2", > >>>>>>>> "collection":"production_things", > >>>>>>>> "node_name":"10.31.41.59:8883_solr", > >>>>>>>> "base_url":"http://10.31.41.59:8883/solr"} > >>>>>>>> > >>>>>>>> i'm trying to create a new collection through collection API, and > >>>>>>>> obviously, nothing is happening... > >>>>>>>> > >>>>>>>> any suggestion on how to fix this? drop the queue in zk? > >>>>>>>> > >>>>>>>> how could did it have gotten in this state in the first place? > >>>>>>>> > >>>>>>>> thanks, > >>>>>>>> gary > >>>>>>> > >>>>>>> > >>>>>> > >>>> > >>>> > >>> > >> > >