nm - can't read my own output - the leader had more docs than the replica ;-)
On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <thelabd...@gmail.com> wrote: > Have a little more info about this ... the numDocs for *:* fluctuates > between two values (difference of 324 docs) depending on which nodes I > hit (distrib=true) > > 589,674,416 > 589,674,092 > > Using distrib=false, I found 1 shard with a mis-match: > > shard15: { > leader = 32,765,254 > replica = 32,764,930 diff:324 > } > > Interesting that the replica has more docs than the leader. > > Unfortunately, due to some bad log management scripting on my part, > the logs were lost when these instances got re-started, which really > bums me out :-( > > For now, I'm going to assume the replica with more docs is the one I > want to keep and will replicate the full index over to the other one. > Sorry about losing the logs :-( > > Tim > > > > > On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <thelabd...@gmail.com> wrote: >> Thanks for responding Mark. I'll collect the information you asked >> about and open a JIRA once I have a little more understanding of what >> happened. Hopefully I can piece together some story after going over >> the logs. >> >> As for replica / leader, I suspect some leaders went down but >> fail-over to new leaders seemed to work fine. We lost about 9 nodes at >> once and continued to serve queries, which is awesome. >> >> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <markrmil...@gmail.com> wrote: >>> Yeah, thats no good. >>> >>> You might hit each node with distrib=false to get the doc counts. >>> >>> Which ones have what you think are the right counts and which the wrong - >>> eg is it all replicas that are off, or leaders as well? >>> >>> You say several replicas - do you mean no leaders went down? >>> >>> You might look closer at the logs for a node that has it's count off. >>> >>> Finally, I guess I'd try and track it in a JIRA issue. >>> >>> - Mark >>> >>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <thelabd...@gmail.com> wrote: >>> >>>> We had a rogue query take out several replicas in a large 4.2.0 cluster >>>> today, due to OOM's (we use the JVM args to kill the process on OOM). >>>> >>>> After recovering, when I execute the match all docs query (*:*), I get a >>>> different count each time. >>>> >>>> In other words, if I execute q=*:* several times in a row, then I get a >>>> different count back for numDocs. >>>> >>>> This was not the case prior to the failure as that is one thing we monitor >>>> for. >>>> >>>> I think I should be worried ... any ideas on how to troubleshoot this? One >>>> thing to mention is that several of my replicas had to do full recoveries >>>> from the leader when they came back online. Indexing was happening when the >>>> replicas failed. >>>> >>>> Thanks. >>>> Tim >>>