Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Timothy Potter Mon, 22 Apr 2013 11:20:17 -0700

nm - can't read my own output - the leader had more docs than the replica ;-)


On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <thelabd...@gmail.com> wrote:
> Have a little more info about this ... the numDocs for *:* fluctuates
> between two values (difference of 324 docs) depending on which nodes I
> hit (distrib=true)
>
> 589,674,416
> 589,674,092
>
> Using distrib=false, I found 1 shard with a mis-match:
>
> shard15: {
>   leader = 32,765,254
>   replica = 32,764,930 diff:324
> }
>
> Interesting that the replica has more docs than the leader.
>
> Unfortunately, due to some bad log management scripting on my part,
> the logs were lost when these instances got re-started, which really
> bums me out :-(
>
> For now, I'm going to assume the replica with more docs is the one I
> want to keep and will replicate the full index over to the other one.
> Sorry about losing the logs :-(
>
> Tim
>
>
>
>
> On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <thelabd...@gmail.com> wrote:
>> Thanks for responding Mark. I'll collect the information you asked
>> about and open a JIRA once I have a little more understanding of what
>> happened. Hopefully I can piece together some story after going over
>> the logs.
>>
>> As for replica / leader, I suspect some leaders went down but
>> fail-over to new leaders seemed to work fine. We lost about 9 nodes at
>> once and continued to serve queries, which is awesome.
>>
>> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <markrmil...@gmail.com> wrote:
>>> Yeah, thats no good.
>>>
>>> You might hit each node with distrib=false to get the doc counts.
>>>
>>> Which ones have what you think are the right counts and which the wrong - 
>>> eg is it all replicas that are off, or leaders as well?
>>>
>>> You say several replicas - do you mean no leaders went down?
>>>
>>> You might look closer at the logs for a node that has it's count off.
>>>
>>> Finally, I guess I'd try and track it in a JIRA issue.
>>>
>>> - Mark
>>>
>>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <thelabd...@gmail.com> wrote:
>>>
>>>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>>>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>>>
>>>> After recovering, when I execute the match all docs query (*:*), I get a
>>>> different count each time.
>>>>
>>>> In other words, if I execute q=*:* several times in a row, then I get a
>>>> different count back for numDocs.
>>>>
>>>> This was not the case prior to the failure as that is one thing we monitor
>>>> for.
>>>>
>>>> I think I should be worried ... any ideas on how to troubleshoot this? One
>>>> thing to mention is that several of my replicas had to do full recoveries
>>>> from the leader when they came back online. Indexing was happening when the
>>>> replicas failed.
>>>>
>>>> Thanks.
>>>> Tim
>>>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Reply via email to