Thanks for responding Mark. I'll collect the information you asked about and open a JIRA once I have a little more understanding of what happened. Hopefully I can piece together some story after going over the logs.
As for replica / leader, I suspect some leaders went down but fail-over to new leaders seemed to work fine. We lost about 9 nodes at once and continued to serve queries, which is awesome. On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <markrmil...@gmail.com> wrote: > Yeah, thats no good. > > You might hit each node with distrib=false to get the doc counts. > > Which ones have what you think are the right counts and which the wrong - eg > is it all replicas that are off, or leaders as well? > > You say several replicas - do you mean no leaders went down? > > You might look closer at the logs for a node that has it's count off. > > Finally, I guess I'd try and track it in a JIRA issue. > > - Mark > > On Apr 19, 2013, at 6:37 PM, Timothy Potter <thelabd...@gmail.com> wrote: > >> We had a rogue query take out several replicas in a large 4.2.0 cluster >> today, due to OOM's (we use the JVM args to kill the process on OOM). >> >> After recovering, when I execute the match all docs query (*:*), I get a >> different count each time. >> >> In other words, if I execute q=*:* several times in a row, then I get a >> different count back for numDocs. >> >> This was not the case prior to the failure as that is one thing we monitor >> for. >> >> I think I should be worried ... any ideas on how to troubleshoot this? One >> thing to mention is that several of my replicas had to do full recoveries >> from the leader when they came back online. Indexing was happening when the >> replicas failed. >> >> Thanks. >> Tim >