Thanks for responding Mark. I'll collect the information you asked
about and open a JIRA once I have a little more understanding of what
happened. Hopefully I can piece together some story after going over
the logs.

As for replica / leader, I suspect some leaders went down but
fail-over to new leaders seemed to work fine. We lost about 9 nodes at
once and continued to serve queries, which is awesome.

On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <markrmil...@gmail.com> wrote:
> Yeah, thats no good.
>
> You might hit each node with distrib=false to get the doc counts.
>
> Which ones have what you think are the right counts and which the wrong - eg 
> is it all replicas that are off, or leaders as well?
>
> You say several replicas - do you mean no leaders went down?
>
> You might look closer at the logs for a node that has it's count off.
>
> Finally, I guess I'd try and track it in a JIRA issue.
>
> - Mark
>
> On Apr 19, 2013, at 6:37 PM, Timothy Potter <thelabd...@gmail.com> wrote:
>
>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>
>> After recovering, when I execute the match all docs query (*:*), I get a
>> different count each time.
>>
>> In other words, if I execute q=*:* several times in a row, then I get a
>> different count back for numDocs.
>>
>> This was not the case prior to the failure as that is one thing we monitor
>> for.
>>
>> I think I should be worried ... any ideas on how to troubleshoot this? One
>> thing to mention is that several of my replicas had to do full recoveries
>> from the leader when they came back online. Indexing was happening when the
>> replicas failed.
>>
>> Thanks.
>> Tim
>

Reply via email to