Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Mark Miller Mon, 22 Apr 2013 13:42:16 -0700

What do you know about the # of docs you *should*? Do you have that mean when 
taking the bad replica out of the equation?


- Mark

On Apr 22, 2013, at 4:33 PM, Mark Miller <markrmil...@gmail.com> wrote:

> Bummer on the log loss :(
> 
> Good info though. Somehow that replica became active without actually 
> syncing? This is heavily tested (though not with OOM's I suppose), so I'm a 
> little surprised, but it's hard to speculate how it happened without the 
> logs. Specially, the logs from the node that is off would be great - we would 
> see what it did when it recovered and why it might think it was in sync :(
> 
> - Mark
> 
> On Apr 22, 2013, at 2:19 PM, Timothy Potter <thelabd...@gmail.com> wrote:
> 
>> nm - can't read my own output - the leader had more docs than the replica ;-)
>> 
>> On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <thelabd...@gmail.com> 
>> wrote:
>>> Have a little more info about this ... the numDocs for *:* fluctuates
>>> between two values (difference of 324 docs) depending on which nodes I
>>> hit (distrib=true)
>>> 
>>> 589,674,416
>>> 589,674,092
>>> 
>>> Using distrib=false, I found 1 shard with a mis-match:
>>> 
>>> shard15: {
>>> leader = 32,765,254
>>> replica = 32,764,930 diff:324
>>> }
>>> 
>>> Interesting that the replica has more docs than the leader.
>>> 
>>> Unfortunately, due to some bad log management scripting on my part,
>>> the logs were lost when these instances got re-started, which really
>>> bums me out :-(
>>> 
>>> For now, I'm going to assume the replica with more docs is the one I
>>> want to keep and will replicate the full index over to the other one.
>>> Sorry about losing the logs :-(
>>> 
>>> Tim
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <thelabd...@gmail.com> 
>>> wrote:
>>>> Thanks for responding Mark. I'll collect the information you asked
>>>> about and open a JIRA once I have a little more understanding of what
>>>> happened. Hopefully I can piece together some story after going over
>>>> the logs.
>>>> 
>>>> As for replica / leader, I suspect some leaders went down but
>>>> fail-over to new leaders seemed to work fine. We lost about 9 nodes at
>>>> once and continued to serve queries, which is awesome.
>>>> 
>>>> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <markrmil...@gmail.com> 
>>>> wrote:
>>>>> Yeah, thats no good.
>>>>> 
>>>>> You might hit each node with distrib=false to get the doc counts.
>>>>> 
>>>>> Which ones have what you think are the right counts and which the wrong - 
>>>>> eg is it all replicas that are off, or leaders as well?
>>>>> 
>>>>> You say several replicas - do you mean no leaders went down?
>>>>> 
>>>>> You might look closer at the logs for a node that has it's count off.
>>>>> 
>>>>> Finally, I guess I'd try and track it in a JIRA issue.
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <thelabd...@gmail.com> wrote:
>>>>> 
>>>>>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>>>>>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>>>>> 
>>>>>> After recovering, when I execute the match all docs query (*:*), I get a
>>>>>> different count each time.
>>>>>> 
>>>>>> In other words, if I execute q=*:* several times in a row, then I get a
>>>>>> different count back for numDocs.
>>>>>> 
>>>>>> This was not the case prior to the failure as that is one thing we 
>>>>>> monitor
>>>>>> for.
>>>>>> 
>>>>>> I think I should be worried ... any ideas on how to troubleshoot this? 
>>>>>> One
>>>>>> thing to mention is that several of my replicas had to do full recoveries
>>>>>> from the leader when they came back online. Indexing was happening when 
>>>>>> the
>>>>>> replicas failed.
>>>>>> 
>>>>>> Thanks.
>>>>>> Tim
>>>>> 
>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Reply via email to