Re: overseer queue clogged

Gary Yngve Fri, 15 Mar 2013 21:30:50 -0700

I will upgrade to 4.2 this weekend and see what happens.  We are on ec2 and
have had a few issues with hostnames with both zk and solr. (but in this
case i haven't rebooted any instances either)


it's relatively a pain to do the upgrade because we have a query/scorer
fork of lucene along with supplemental jars, and zk cannot distribute
binary jars via the config.

we are also multi-collection per zk... i wish it didn't require a core
always defined up front for the core admin?  i would love to have an
instance have no cores and then just create the core i need..

-g



On Fri, Mar 15, 2013 at 7:14 PM, Mark Miller <[email protected]> wrote:

>
> On Mar 15, 2013, at 10:04 PM, Gary Yngve <[email protected]> wrote:
>
> > i think those followers are red from trying to forward requests to the
> > overseer while it was being restarted.  i guess i'll see if they become
> > green over time.  or i guess i can restart them one at a time..
>
> Restarting the cluster clear things up. It shouldn't take too long for
> those nodes to recover though - they should have been up to date before.
> The couple exceptions you posted def indicate something is out of whack.
> It's something I'd like to get to the bottom of.
>
> - Mark
>
> >
> >
> > On Fri, Mar 15, 2013 at 6:53 PM, Gary Yngve <[email protected]>
> wrote:
> >
> >> it doesn't appear to be a shard1 vs shard11 issue... 60% of my followers
> >> are red now in the solr cloud graph.. trying to figure out what that
> >> means...
> >>
> >>
> >> On Fri, Mar 15, 2013 at 6:48 PM, Gary Yngve <[email protected]>
> wrote:
> >>
> >>> I restarted the overseer node and another took over, queues are empty
> now.
> >>>
> >>> the server with core production_things_shard1_2
> >>> is having these errors:
> >>>
> >>> shard update error RetryNode:
> >>>
> http://10.104.59.189:8883/solr/production_things_shard11_replica1/:org.apache.solr.client.solrj.SolrServerException
> :
> >>> Server refused connection at:
> >>> http://10.104.59.189:8883/solr/production_things_shard11_replica1
> >>>
> >>>  for shard11!!!
> >>>
> >>> I also got some strange errors on the restarted node.  Makes me wonder
> if
> >>> there is a string-matching bug for shard1 vs shard11?
> >>>
> >>> SEVERE: :org.apache.solr.common.SolrException: Error getting leader
> from
> >>> zk
> >>>  at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:771)
> >>>  at org.apache.solr.cloud.ZkController.register(ZkController.java:683)
> >>>  at org.apache.solr.cloud.ZkController.register(ZkController.java:634)
> >>>  at
> >>> org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:890)
> >>>  at
> >>> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:874)
> >>>  at org.apache.solr.core.CoreContainer.register(CoreContainer.java:823)
> >>>  at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:633)
> >>>  at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
> >>>  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >>>  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >>>  at
> >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >>>  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >>>  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >>>  at
> >>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>>  at
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>>  at java.lang.Thread.run(Thread.java:722)
> >>> Caused by: org.apache.solr.common.SolrException: There is conflicting
> >>> information about the leader
> >>> of shard: shard1 our state says:
> >>> http://10.104.59.189:8883/solr/collection1/ but zookeeper says:http
> >>> ://10.217.55.151:8883/solr/collection1/
> >>>  at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:756)
> >>>
> >>> INFO: Releasing
> >>>
> directory:/vol/ubuntu/talemetry_match_solr/solr_server/solr/production_things_shar
> >>> d11_replica1/data/index
> >>> Mar 15, 2013 5:52:34 PM org.apache.solr.common.SolrException log
> >>> SEVERE: org.apache.solr.common.SolrException: Error opening new
> searcher
> >>>  at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1423)
> >>>  at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1535)
> >>>
> >>> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
> >>> state recovering for 10.76.31.
> >>> 67:8883_solr but I still do not see the requested state. I see state:
> >>> active live:true
> >>>  at
> >>>
> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler
> >>> .java:948)
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Mar 15, 2013 at 5:05 PM, Mark Miller <[email protected]
> >wrote:
> >>>
> >>>> Strange - we hardened that loop in 4.1 - so I'm not sure what happened
> >>>> here.
> >>>>
> >>>> Can you do a stack dump on the overseer and see if you see an Overseer
> >>>> thread running perhaps? Or just post the results?
> >>>>
> >>>> To recover, you should be able to just restart the Overseer node and
> >>>> have someone else take over - they should pick up processing the
> queue.
> >>>>
> >>>> Any logs you might be able to share could be useful too.
> >>>>
> >>>> - Mark
> >>>>
> >>>> On Mar 15, 2013, at 7:51 PM, Gary Yngve <[email protected]> wrote:
> >>>>
> >>>>> Also, looking at overseer_elect, everything looks fine.  node is
> valid
> >>>> and
> >>>>> live.
> >>>>>
> >>>>>
> >>>>> On Fri, Mar 15, 2013 at 4:47 PM, Gary Yngve <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>>> Sorry, should have specified.  4.1
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Mar 15, 2013 at 4:33 PM, Mark Miller <[email protected]
> >>>>> wrote:
> >>>>>>
> >>>>>>> What Solr version? 4.0, 4.1 4.2?
> >>>>>>>
> >>>>>>> - Mark
> >>>>>>>
> >>>>>>> On Mar 15, 2013, at 7:19 PM, Gary Yngve <[email protected]>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> my solr cloud has been running fine for weeks, but about a week
> >>>> ago, it
> >>>>>>>> stopped dequeueing from the overseer queue, and now there are
> >>>> thousands
> >>>>>>> of
> >>>>>>>> tasks on the queue, most which look like
> >>>>>>>>
> >>>>>>>> {
> >>>>>>>> "operation":"state",
> >>>>>>>> "numShards":null,
> >>>>>>>> "shard":"shard3",
> >>>>>>>> "roles":null,
> >>>>>>>> "state":"recovering",
> >>>>>>>> "core":"production_things_shard3_2",
> >>>>>>>> "collection":"production_things",
> >>>>>>>> "node_name":"10.31.41.59:8883_solr",
> >>>>>>>> "base_url":"http://10.31.41.59:8883/solr"}
> >>>>>>>>
> >>>>>>>> i'm trying to create a new collection through collection API, and
> >>>>>>>> obviously, nothing is happening...
> >>>>>>>>
> >>>>>>>> any suggestion on how to fix this?  drop the queue in zk?
> >>>>>>>>
> >>>>>>>> how could did it have gotten in this state in the first place?
> >>>>>>>>
> >>>>>>>> thanks,
> >>>>>>>> gary
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Re: overseer queue clogged

Reply via email to