Have a little more info about this ... the numDocs for *:* fluctuates between two values (difference of 324 docs) depending on which nodes I hit (distrib=true)
589,674,416 589,674,092 Using distrib=false, I found 1 shard with a mis-match: shard15: { leader = 32,765,254 replica = 32,764,930 diff:324 } Interesting that the replica has more docs than the leader. Unfortunately, due to some bad log management scripting on my part, the logs were lost when these instances got re-started, which really bums me out :-( For now, I'm going to assume the replica with more docs is the one I want to keep and will replicate the full index over to the other one. Sorry about losing the logs :-( Tim On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <thelabd...@gmail.com> wrote: > Thanks for responding Mark. I'll collect the information you asked > about and open a JIRA once I have a little more understanding of what > happened. Hopefully I can piece together some story after going over > the logs. > > As for replica / leader, I suspect some leaders went down but > fail-over to new leaders seemed to work fine. We lost about 9 nodes at > once and continued to serve queries, which is awesome. > > On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <markrmil...@gmail.com> wrote: >> Yeah, thats no good. >> >> You might hit each node with distrib=false to get the doc counts. >> >> Which ones have what you think are the right counts and which the wrong - eg >> is it all replicas that are off, or leaders as well? >> >> You say several replicas - do you mean no leaders went down? >> >> You might look closer at the logs for a node that has it's count off. >> >> Finally, I guess I'd try and track it in a JIRA issue. >> >> - Mark >> >> On Apr 19, 2013, at 6:37 PM, Timothy Potter <thelabd...@gmail.com> wrote: >> >>> We had a rogue query take out several replicas in a large 4.2.0 cluster >>> today, due to OOM's (we use the JVM args to kill the process on OOM). >>> >>> After recovering, when I execute the match all docs query (*:*), I get a >>> different count each time. >>> >>> In other words, if I execute q=*:* several times in a row, then I get a >>> different count back for numDocs. >>> >>> This was not the case prior to the failure as that is one thing we monitor >>> for. >>> >>> I think I should be worried ... any ideas on how to troubleshoot this? One >>> thing to mention is that several of my replicas had to do full recoveries >>> from the leader when they came back online. Indexing was happening when the >>> replicas failed. >>> >>> Thanks. >>> Tim >>