On 07/09/2018 05:53 PM, Digimer wrote: > On 2018-07-09 11:45 AM, Klaus Wenninger wrote: >> On 07/09/2018 05:33 PM, Digimer wrote: >>> On 2018-07-09 09:56 AM, Klaus Wenninger wrote: >>>> On 07/09/2018 03:49 PM, Digimer wrote: >>>>> On 2018-07-09 08:31 AM, Klaus Wenninger wrote: >>>>>> On 07/09/2018 02:04 PM, Confidential Company wrote: >>>>>>> Hi, >>>>>>> >>>>>>> Any ideas what triggers fencing script or stonith? >>>>>>> >>>>>>> Given the setup below: >>>>>>> 1. I have two nodes >>>>>>> 2. Configured fencing on both nodes >>>>>>> 3. Configured delay=15 and delay=30 on fence1(for Node1) and >>>>>>> fence2(for Node2) respectively >>>>>>> >>>>>>> *What does it mean to configured delay in stonith? wait for 15 seconds >>>>>>> before it fence the node? >>>>>> Given that on a 2-node-cluster you don't have real quorum to make one >>>>>> partial cluster fence the rest of the nodes the different delays are >>>>>> meant >>>>>> to prevent a fencing-race. >>>>>> Without different delays that would lead to both nodes fencing each >>>>>> other at the same time - finally both being down. >>>>> Not true, the faster node will kill the slower node first. It is >>>>> possible that through misconfiguration, both could die, but it's rare >>>>> and easily avoided with a 'delay="15"' set on the fence config for the >>>>> node you want to win. >>>> What exactly is not true? Aren't we saying the same? >>>> Of course one of the delays can be 0 (most important is that >>>> they are different). >>> Perhaps I misunderstood your message. It seemed to me that the >>> implication was that fencing in 2-node without a delay always ends up >>> with both nodes being down, which isn't the case. It can happen if the >>> fence methods are not setup right (ie: the node isn't set to immediately >>> power off on ACPI power button event). >> Yes, a misunderstanding I guess. >> >> Should have been more verbose in saying that due to the >> time between the fencing-command fired off to the fencing >> device and the actual fencing taking place (as you state >> dependent on how it is configured in detail - but a measurable >> time in all cases) there is a certain probability that when >> both nodes start fencing at roughly the same time we will >> end up with 2 nodes down. >> >> Everybody has to find his own tradeoff between reliability >> fence-races are prevented and fencing delay I guess. > We've used this; > > 1. IPMI (with the guest OS set to immediately power off) as primary, > with a 15 second delay on the active node. > > 2. Two Switched PDUs (two power circuits, two PSUs) as backup fencing > for when IPMI fails, with no delay. > > In ~8 years, across dozens and dozens of clusters and countless fence > actions, we've never had a dual-fence event (where both nodes go down). > So it can be done safely, but as always, test test test before prod.
No doubt about that this setup is working reliably. You just have to know your fencing-devices and which delays they involve. If we are talking about SBD (with disk as otherwise it doesn't work in a sensible way in 2-node-clusters) for instance I would strongly advise using a delay. So I guess it is important to understand the basic idea behind this different delay-based fence-race avoidance. Afterwards you can still decide why it is no issue in your own setup. > >>> If the delay is set on both nodes, and they are different, it will work >>> fine. The reason not to do this is that if you use 0, then don't use >>> anything at all (0 is default), and any other value causes avoidable >>> fence delays. >>> >>>>> Don't use a delay on the other node, just the node you want to live in >>>>> such a case. >>>>> >>>>>>> *Given Node1 is active and Node2 goes down, does it mean fence1 will >>>>>>> first execute and shutdowns Node1 even though Node2 goes down? >>>>>> If Node2 managed to sign off properly it will not. >>>>>> If network-connection is down so that Node2 can't inform Node1 that it >>>>>> is going >>>>>> down and finally has stopped all resources it will be fenced by Node1. >>>>>> >>>>>> Regards, >>>>>> Klaus >>>>> Fencing occurs in two cases; >>>>> >>>>> 1. The node stops responding (meaning it's in an unknown state, so it is >>>>> fenced to force it into a known state). >>>>> 2. A resource / service fails to stop stop. In this case, the service is >>>>> in an unknown state, so the node is fenced to force the service into a >>>>> known state so that it can be safely recovered on the peer. >>>>> >>>>> Graceful withdrawal of the node from the cluster, and graceful stopping >>>>> of services will not lead to a fence (because in both cases, the node / >>>>> service are in a known state - off). >>>>> >>> > _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
