What do you know about the # of docs you *should*? Do you have that mean when taking the bad replica out of the equation?
- Mark On Apr 22, 2013, at 4:33 PM, Mark Miller <markrmil...@gmail.com> wrote: > Bummer on the log loss :( > > Good info though. Somehow that replica became active without actually > syncing? This is heavily tested (though not with OOM's I suppose), so I'm a > little surprised, but it's hard to speculate how it happened without the > logs. Specially, the logs from the node that is off would be great - we would > see what it did when it recovered and why it might think it was in sync :( > > - Mark > > On Apr 22, 2013, at 2:19 PM, Timothy Potter <thelabd...@gmail.com> wrote: > >> nm - can't read my own output - the leader had more docs than the replica ;-) >> >> On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <thelabd...@gmail.com> >> wrote: >>> Have a little more info about this ... the numDocs for *:* fluctuates >>> between two values (difference of 324 docs) depending on which nodes I >>> hit (distrib=true) >>> >>> 589,674,416 >>> 589,674,092 >>> >>> Using distrib=false, I found 1 shard with a mis-match: >>> >>> shard15: { >>> leader = 32,765,254 >>> replica = 32,764,930 diff:324 >>> } >>> >>> Interesting that the replica has more docs than the leader. >>> >>> Unfortunately, due to some bad log management scripting on my part, >>> the logs were lost when these instances got re-started, which really >>> bums me out :-( >>> >>> For now, I'm going to assume the replica with more docs is the one I >>> want to keep and will replicate the full index over to the other one. >>> Sorry about losing the logs :-( >>> >>> Tim >>> >>> >>> >>> >>> On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <thelabd...@gmail.com> >>> wrote: >>>> Thanks for responding Mark. I'll collect the information you asked >>>> about and open a JIRA once I have a little more understanding of what >>>> happened. Hopefully I can piece together some story after going over >>>> the logs. >>>> >>>> As for replica / leader, I suspect some leaders went down but >>>> fail-over to new leaders seemed to work fine. We lost about 9 nodes at >>>> once and continued to serve queries, which is awesome. >>>> >>>> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <markrmil...@gmail.com> >>>> wrote: >>>>> Yeah, thats no good. >>>>> >>>>> You might hit each node with distrib=false to get the doc counts. >>>>> >>>>> Which ones have what you think are the right counts and which the wrong - >>>>> eg is it all replicas that are off, or leaders as well? >>>>> >>>>> You say several replicas - do you mean no leaders went down? >>>>> >>>>> You might look closer at the logs for a node that has it's count off. >>>>> >>>>> Finally, I guess I'd try and track it in a JIRA issue. >>>>> >>>>> - Mark >>>>> >>>>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <thelabd...@gmail.com> wrote: >>>>> >>>>>> We had a rogue query take out several replicas in a large 4.2.0 cluster >>>>>> today, due to OOM's (we use the JVM args to kill the process on OOM). >>>>>> >>>>>> After recovering, when I execute the match all docs query (*:*), I get a >>>>>> different count each time. >>>>>> >>>>>> In other words, if I execute q=*:* several times in a row, then I get a >>>>>> different count back for numDocs. >>>>>> >>>>>> This was not the case prior to the failure as that is one thing we >>>>>> monitor >>>>>> for. >>>>>> >>>>>> I think I should be worried ... any ideas on how to troubleshoot this? >>>>>> One >>>>>> thing to mention is that several of my replicas had to do full recoveries >>>>>> from the leader when they came back online. Indexing was happening when >>>>>> the >>>>>> replicas failed. >>>>>> >>>>>> Thanks. >>>>>> Tim >>>>> >