Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Mark Miller Sat, 20 Apr 2013 09:12:22 -0700

Yeah, thats no good.

You might hit each node with distrib=false to get the doc counts.

Which ones have what you think are the right counts and which the wrong - eg is 
it all replicas that are off, or leaders as well?

You say several replicas - do you mean no leaders went down?

You might look closer at the logs for a node that has it's count off.

Finally, I guess I'd try and track it in a JIRA issue.

- Mark

On Apr 19, 2013, at 6:37 PM, Timothy Potter <[email protected]> wrote:

> We had a rogue query take out several replicas in a large 4.2.0 cluster
> today, due to OOM's (we use the JVM args to kill the process on OOM).
> 
> After recovering, when I execute the match all docs query (*:*), I get a
> different count each time.
> 
> In other words, if I execute q=*:* several times in a row, then I get a
> different count back for numDocs.
> 
> This was not the case prior to the failure as that is one thing we monitor
> for.
> 
> I think I should be worried ... any ideas on how to troubleshoot this? One
> thing to mention is that several of my replicas had to do full recoveries
> from the leader when they came back online. Indexing was happening when the
> replicas failed.
> 
> Thanks.
> Tim

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Reply via email to