Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Timothy Potter Mon, 22 Apr 2013 10:42:50 -0700

Have a little more info about this ... the numDocs for *:* fluctuates
between two values (difference of 324 docs) depending on which nodes I
hit (distrib=true)


589,674,416
589,674,092

Using distrib=false, I found 1 shard with a mis-match:

shard15: {
  leader = 32,765,254
  replica = 32,764,930 diff:324
}

Interesting that the replica has more docs than the leader.

Unfortunately, due to some bad log management scripting on my part,
the logs were lost when these instances got re-started, which really
bums me out :-(

For now, I'm going to assume the replica with more docs is the one I
want to keep and will replicate the full index over to the other one.
Sorry about losing the logs :-(

Tim




On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <thelabd...@gmail.com> wrote:
> Thanks for responding Mark. I'll collect the information you asked
> about and open a JIRA once I have a little more understanding of what
> happened. Hopefully I can piece together some story after going over
> the logs.
>
> As for replica / leader, I suspect some leaders went down but
> fail-over to new leaders seemed to work fine. We lost about 9 nodes at
> once and continued to serve queries, which is awesome.
>
> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <markrmil...@gmail.com> wrote:
>> Yeah, thats no good.
>>
>> You might hit each node with distrib=false to get the doc counts.
>>
>> Which ones have what you think are the right counts and which the wrong - eg 
>> is it all replicas that are off, or leaders as well?
>>
>> You say several replicas - do you mean no leaders went down?
>>
>> You might look closer at the logs for a node that has it's count off.
>>
>> Finally, I guess I'd try and track it in a JIRA issue.
>>
>> - Mark
>>
>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <thelabd...@gmail.com> wrote:
>>
>>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>>
>>> After recovering, when I execute the match all docs query (*:*), I get a
>>> different count each time.
>>>
>>> In other words, if I execute q=*:* several times in a row, then I get a
>>> different count back for numDocs.
>>>
>>> This was not the case prior to the failure as that is one thing we monitor
>>> for.
>>>
>>> I think I should be worried ... any ideas on how to troubleshoot this? One
>>> thing to mention is that several of my replicas had to do full recoveries
>>> from the leader when they came back online. Indexing was happening when the
>>> replicas failed.
>>>
>>> Thanks.
>>> Tim
>>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Reply via email to