>>> Ken Gaillot <[email protected]> schrieb am 27.06.2018 um 16:32 in >>> Nachricht <[email protected]>: > On Wed, 2018-06-27 at 09:18 -0500, Ken Gaillot wrote: >> On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote: >> > > > > Ken Gaillot <[email protected]> schrieb am 26.06.2018 um >> > > > > 18:22 in Nachricht >> > >> > <[email protected]>: >> > > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote: >> > > > 26.06.2018 09:14, Ulrich Windl wrote: >> > > > > Hi! >> > > > > >> > > > > We just observed some strange effect we cannot explain in >> > > > > SLES >> > > > > 11 >> > > > > SP4 (pacemaker 1.1.12-f47ea56): >> > > > > We run about a dozen of Xen PVMs on a three-node cluster >> > > > > (plus >> > > > > some >> > > > > infrastructure and monitoring stuff). It worked all well so >> > > > > far, >> > > > > and there was no significant change recently. >> > > > > However when a colleague stopped on VM for maintenance via >> > > > > cluster >> > > > > command, the cluster did not notice when the PVM actually was >> > > > > running again (it had been started not using the cluster (a >> > > > > bad >> > > > > idea, I know)). >> > > > >> > > > To be on a safe side in such cases you'd probably want to >> > > > enable >> > > > additional monitor for a "Stopped" role. Default one covers >> > > > only >> > > > "Started" role. The same thing as for multistate resources, >> > > > where >> > > > you >> > > > need several monitor ops, for "Started/Slave" and "Master" >> > > > roles. >> > > > But, this will increase a load. >> > > > And, I believe cluster should reprobe a resource on all nodes >> > > > once >> > > > you >> > > > change target-role back to "Started". >> > > >> > > Which raises the question, how did you stop the VM initially? >> > >> > I thought "(...) stopped one VM for maintenance via cluster >> > command" >> > is obvious. It was something like "crm resource stop ...". >> > >> > > >> > > If you stopped it by setting target-role to Stopped, likely the >> > > cluster >> > > still thinks it's stopped, and you need to set it to Started >> > > again. >> > > If >> > > instead you set maintenance mode or unmanaged the resource, then >> > > stopped the VM manually, then most likely it's still in that mode >> > > and >> > > needs to be taken out of it. >> > >> > The point was when the command to start the resource was given, the >> > cluster had completely ignored the fact that it was running already >> > and started to start the VM on a second node (which may be >> > desastrous). But that's leading away from the main question... >> >> Ah, this is expected behavior when you start a resource manually, and >> there are no monitors with target-role=Stopped. If the node where you >> manually started the VM isn't the same node the cluster happens to >> choose, then you can get multiple active instances. >> >> By default, the cluster assumes that where a probe found a resource >> to >> be not running, that resource will stay not running unless started by >> the cluster. (It will re-probe if the node goes away and comes back.) >> >> If you wish to guard against resources being started outside cluster >> control, configure a recurring monitor with target-role=Stopped, and >> the cluster will run that on all nodes where it thinks the resource >> is >> not supposed to be running. Of course since it has to poll at >> intervals, it can take up to that much time to detect a manually >> started instance. > > Alternatively, if you don't want the overhead of a recurring monitor > but want to be able to address known manual starts yourself, you can > force a full reprobe of the resource with "crm_resource -r <resource- > id> --refresh". > > If you do it before starting the resource via crm, the cluster will > stop the manually started instance, and then you can start it via the > crm; if you do it after starting the resource via crm, there will still > likely be two active instances, and the cluster will stop both and > start one again. > > A way around that would be to unmanage the resource, start the resource > via crm (which won't actually start anything due to being unmanaged, > but will tell the cluster it's supposed to be started), force a > reprobe, then manage the resource again -- that should prevent multiple > active. However if the cluster prefers a different node, it may still > stop the resource and start it in its preferred location. (Stickiness > could get around that.)
Hi! Thanks again for that. There's one question that comes to my mind: What is the purpose of the cluster recheck interval? I thought it's exactly that, finding resources that are not in the state they should be. Regards, Ulrich > >> >> > > > > Examining the logs, it seems that the recheck timer popped >> > > > > periodically, but no monitor action was run for the VM (the >> > > > > action >> > > > > is configured to run every 10 minutes). >> >> Recurring monitors are only recorded in the log if their return value >> changed. If there are 10 successful monitors in a row and then a >> failure, only the first success and the failure are logged. >> >> > > > > >> > > > > Actually the only monitor operations found were: >> > > > > May 23 08:04:13 >> > > > > Jun 13 08:13:03 >> > > > > Jun 25 09:29:04 >> > > > > Then a manual "reprobe" was done, and several monitor >> > > > > operations >> > > > > were run. >> > > > > Then again I see no more monitor actions in syslog. >> > > > > >> > > > > What could be the reasons for this? Too many operations >> > > > > defined? >> > > > > >> > > > > The other message I don't understand is like "<other- >> > > > > resource>: >> > > > > Rolling back scores from <vm-resource>" >> > > > > >> > > > > Could it be a new bug introduced in pacemaker, or could it be >> > > > > some >> > > > > configuration problem (The status is completely clean >> > > > > however)? >> > > > > >> > > > > According to the packet changelog, there was no change since >> > > > > Nov >> > > > > 2016... >> > > > > >> > > > > Regards, >> > > > > Ulrich > -- > Ken Gaillot <[email protected]> > _______________________________________________ > Users mailing list: [email protected] > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
