Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

Strahil Nikolov via Users Mon, 09 Aug 2021 04:41:03 -0700

I've setup something similar with VIP that is everywhere using the 
globally-unique=true (where cluster controls which node to be passive and which 
active). This allows that the VIP is everywhere but only 1 node answers the 
requests , while the WEB server was running everywhere with config and data on 
a shared FS.
Sadly, I can't find my notes right now.
Best Regards,Strahil Nikolov
 
 
  On Mon, Aug 9, 2021 at 13:43, Andreas Janning<[email protected]> 
wrote:   Hi all,
we recently experienced an outage in our pacemaker cluster and I would like to 
understand how we can configure the cluster to avoid this problem in the future.
First our basic setup:- CentOS7- Pacemaker 1.1.23- Corosync 2.4.5- 
Resource-Agents 4.1.1
Our cluster is composed of multiple active/passive nodes. Each software 
component runs on two nodes simultaneously and all traffic is routed to the 
active node via Virtual IP.If the active node fails, the passive node grabs the 
Virtual IP and immediately takes over all work of the failed node. Since the 
software is already up and running on the passive node, there should be 
virtually no downtime.We have tried achieved this in pacemaker by configuring 
clone-sets for each software component.
Now the problem:When a software component fails on the active node, the 
Virtual-IP is correctly grabbed by the passive node. BUT the software component 
is also immediately restarted on the passive Node.That unfortunately defeats 
the purpose of the whole setup, since we now have a downtime until the software 
component is restarted on the passive node and the restart might even fail and 
lead to a complete outage.After some investigating I now understand that the 
cloned resource is restarted on all nodes after a monitoring failure because 
the default "on-fail" of "monitor" is restart. But that is not what I want.
I have created a minimal setup that reproduces the problem:


<configuration>
 <crm_config>
 <cluster_property_set id="cib-bootstrap-options">
 <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" 
value="false"/>
 <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
value="1.1.23-1.el7_9.1-9acf116022"/>
 <nvpair id="cib-bootstrap-options-cluster-infrastructure" 
name="cluster-infrastructure" value="corosync"/>
 <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" 
value="pacemaker-test"/>
 <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" 
value="false"/>
 <nvpair id="cib-bootstrap-options-symmetric-cluster" name="symmetric-cluster" 
value="false"/>
 </cluster_property_set>
 </crm_config>
 <nodes>
 <node id="1" uname="active-node"/>
 <node id="2" uname="passive-node"/>
 </nodes>
 <resources>
 <primitive class="ocf" id="vip" provider="heartbeat" type="IPaddr2">
 <instance_attributes id="vip-instance_attributes">
 <nvpair id="vip-instance_attributes-ip" name="ip" 
value="{{infrastructure.virtual_ip}}"/>
 </instance_attributes>
 <operations>
 <op id="psa-vip-monitor-interval-10s" interval="10s" name="monitor" 
timeout="20s"/>
 <op id="psa-vip-start-interval-0s" interval="0s" name="start" timeout="20s"/>
 <op id="psa-vip-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
 </operations>
 </primitive>
 <clone id="apache-clone">
 <primitive class="ocf" id="apache" provider="heartbeat" type="apache">
 <instance_attributes id="apache-instance_attributes">
 <nvpair id="apache-instance_attributes-port" name="port" value="80"/>
 <nvpair id="apache-instance_attributes-statusurl" name="statusurl" 
value="http://localhost/server-status"/>
 </instance_attributes>
 <operations>
 <op id="apache-monitor-interval-10s" interval="10s" name="monitor" 
timeout="20s"/>
 <op id="apache-start-interval-0s" interval="0s" name="start" timeout="40s"/>
 <op id="apache-stop-interval-0s" interval="0s" name="stop" timeout="60s"/>
 </operations>
 </primitive>
 <meta_attributes id="apache-meta_attributes">
 <nvpair id="apache-clone-meta_attributes-clone-max" name="clone-max" 
value="2"/>
 <nvpair id="apache-clone-meta_attributes-clone-node-max" name="clone-node-max" 
value="1"/>
 </meta_attributes>
 </clone>
 </resources>
 <constraints>
 <rsc_location id="location-apache-clone-active-node-100" node="active-node" 
rsc="apache-clone" score="100" resource-discovery="exclusive"/>
 <rsc_location id="location-apache-clone-passive-node-0" node="passive-node" 
rsc="apache-clone" score="0" resource-discovery="exclusive"/>
 <rsc_location id="location-vip-clone-active-node-100" node="active-node" 
rsc="vip" score="100" resource-discovery="exclusive"/>
 <rsc_location id="location-vip-clone-passive-node-0" node="passive-node" 
rsc="vip" score="0" resource-discovery="exclusive"/>
 <rsc_colocation id="colocation-vip-apache-clone-INFINITY" rsc="vip" 
score="INFINITY" with-rsc="apache-clone"/>
 </constraints>
 <rsc_defaults>
 <meta_attributes id="rsc_defaults-options">
 <nvpair id="rsc_defaults-options-resource-stickiness" 
name="resource-stickiness" value="50"/>
 </meta_attributes>
 </rsc_defaults>
</configuration>



When this configuration is started, httpd will be running on active-node and 
passive-node. The VIP runs only on active-node.When crashing the httpd on 
active-node (with killall httpd), passive-node immediately grabs the VIP and 
restarts its own httpd.
How can I change this configuration so that when the resource fails on 
active-node:- passive-node immediately grabs the VIP (as it does now).
- active-node tries to restart the failed resource, giving up after x 
attempts.- passive-node does NOT restart the resource.
Regards
Andreas Janning



-- 
   
 Beste Arbeitgeber ITK 2021 - 1. Platz für QAware
 ausgezeichnet von Great Place to Work 
  
 Andreas Janning
 Expert Software Engineer
 
 
 QAware GmbH
 Aschauer Straße 32
 81549 München, Germany
 Mobil +49 160 1492426
 [email protected]
 www.qaware.de
 
    
 Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
 Registergericht: München
 Handelsregisternummer: HRB 163761
 
 _______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

Reply via email to