Re: overseer queue clogged

Gary Yngve Fri, 15 Mar 2013 19:04:29 -0700

i think those followers are red from trying to forward requests to the
overseer while it was being restarted.  i guess i'll see if they become
green over time.  or i guess i can restart them one at a time..



On Fri, Mar 15, 2013 at 6:53 PM, Gary Yngve <gary.yn...@gmail.com> wrote:

> it doesn't appear to be a shard1 vs shard11 issue... 60% of my followers
> are red now in the solr cloud graph.. trying to figure out what that
> means...
>
>
> On Fri, Mar 15, 2013 at 6:48 PM, Gary Yngve <gary.yn...@gmail.com> wrote:
>
>> I restarted the overseer node and another took over, queues are empty now.
>>
>> the server with core production_things_shard1_2
>> is having these errors:
>>
>> shard update error RetryNode:
>> http://10.104.59.189:8883/solr/production_things_shard11_replica1/:org.apache.solr.client.solrj.SolrServerException:
>> Server refused connection at:
>> http://10.104.59.189:8883/solr/production_things_shard11_replica1
>>
>>   for shard11!!!
>>
>> I also got some strange errors on the restarted node.  Makes me wonder if
>> there is a string-matching bug for shard1 vs shard11?
>>
>> SEVERE: :org.apache.solr.common.SolrException: Error getting leader from
>> zk
>>   at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:771)
>>   at org.apache.solr.cloud.ZkController.register(ZkController.java:683)
>>   at org.apache.solr.cloud.ZkController.register(ZkController.java:634)
>>   at
>> org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:890)
>>   at
>> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:874)
>>   at org.apache.solr.core.CoreContainer.register(CoreContainer.java:823)
>>   at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:633)
>>   at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
>>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>   at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>   at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>   at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>   at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>   at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>   at java.lang.Thread.run(Thread.java:722)
>> Caused by: org.apache.solr.common.SolrException: There is conflicting
>> information about the leader
>> of shard: shard1 our state says:
>> http://10.104.59.189:8883/solr/collection1/ but zookeeper says:http
>> ://10.217.55.151:8883/solr/collection1/
>>   at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:756)
>>
>> INFO: Releasing
>> directory:/vol/ubuntu/talemetry_match_solr/solr_server/solr/production_things_shar
>> d11_replica1/data/index
>> Mar 15, 2013 5:52:34 PM org.apache.solr.common.SolrException log
>> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher
>>   at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1423)
>>   at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1535)
>>
>> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
>> state recovering for 10.76.31.
>> 67:8883_solr but I still do not see the requested state. I see state:
>> active live:true
>>   at
>> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler
>> .java:948)
>>
>>
>>
>>
>> On Fri, Mar 15, 2013 at 5:05 PM, Mark Miller <markrmil...@gmail.com>wrote:
>>
>>> Strange - we hardened that loop in 4.1 - so I'm not sure what happened
>>> here.
>>>
>>> Can you do a stack dump on the overseer and see if you see an Overseer
>>> thread running perhaps? Or just post the results?
>>>
>>> To recover, you should be able to just restart the Overseer node and
>>> have someone else take over - they should pick up processing the queue.
>>>
>>> Any logs you might be able to share could be useful too.
>>>
>>> - Mark
>>>
>>> On Mar 15, 2013, at 7:51 PM, Gary Yngve <gary.yn...@gmail.com> wrote:
>>>
>>> > Also, looking at overseer_elect, everything looks fine.  node is valid
>>> and
>>> > live.
>>> >
>>> >
>>> > On Fri, Mar 15, 2013 at 4:47 PM, Gary Yngve <gary.yn...@gmail.com>
>>> wrote:
>>> >
>>> >> Sorry, should have specified.  4.1
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Fri, Mar 15, 2013 at 4:33 PM, Mark Miller <markrmil...@gmail.com
>>> >wrote:
>>> >>
>>> >>> What Solr version? 4.0, 4.1 4.2?
>>> >>>
>>> >>> - Mark
>>> >>>
>>> >>> On Mar 15, 2013, at 7:19 PM, Gary Yngve <gary.yn...@gmail.com>
>>> wrote:
>>> >>>
>>> >>>> my solr cloud has been running fine for weeks, but about a week
>>> ago, it
>>> >>>> stopped dequeueing from the overseer queue, and now there are
>>> thousands
>>> >>> of
>>> >>>> tasks on the queue, most which look like
>>> >>>>
>>> >>>> {
>>> >>>> "operation":"state",
>>> >>>> "numShards":null,
>>> >>>> "shard":"shard3",
>>> >>>> "roles":null,
>>> >>>> "state":"recovering",
>>> >>>> "core":"production_things_shard3_2",
>>> >>>> "collection":"production_things",
>>> >>>> "node_name":"10.31.41.59:8883_solr",
>>> >>>> "base_url":"http://10.31.41.59:8883/solr"}
>>> >>>>
>>> >>>> i'm trying to create a new collection through collection API, and
>>> >>>> obviously, nothing is happening...
>>> >>>>
>>> >>>> any suggestion on how to fix this?  drop the queue in zk?
>>> >>>>
>>> >>>> how could did it have gotten in this state in the first place?
>>> >>>>
>>> >>>> thanks,
>>> >>>> gary
>>> >>>
>>> >>>
>>> >>
>>>
>>>
>>
>

Re: overseer queue clogged

Reply via email to