[ClusterLabs] Issue with Pacemaker config related to VIP and an LSB resource

Michael Romero Tue, 15 Jun 2021 15:49:48 -0700

Hello,

I currently have Pacemaker v2.0.3-3ubuntu4.2 running on two Ubuntu 20.04
LTS systems. My config consists of two service groups, both of which have
an LSB resource and a floating IP resource.   The LSB resource is
configured with a monitor operation, so that
"/etc/init.d/<lsb-resource-name> status" is ran in 30 second intervals. the
"status portion of the script only returns a healthy exit code when it
determines that the PID behind a PIDfile is active.  Additionally, I have
also set an 'rsc_location' constraint so that the service group for VIP A
prefers node A, and VIP B prefers node B, so that ideally with both nodes
active and healthy, VIP A will always be running on node A, and B on node B.



The problem that I'm having is that if I intentionally shutdown the service
that my "/etc/init.d/<lsb-resource-name> status" script is checking
against, I get the following behavior:
- I shutdown backing service on node B.
- Pacemaker performs a status check which returns a bad result.
- Pacemaker then correctly migrates the VIP and the LSB resource for the
now 'offline' service group from node B to node A
- Pacemaker 'failure-timeout' interval expires.
- Pacemaker shuts down the VIP B service group on node A.
- Pacemaker attempts to start the VIP B service group on node B, which
fails.
- Pacemaker starts the VIP B service group on node A.
- Pacemaker 'failure-timeout' interval expires.
- Pacemaker shuts down the VIP B service group on node A.
- Pacemaker attempts to start the VIP B service group on node B, which
fails.
- Pacemaker starts the VIP B service group on node A.
- .... and so on

What I would LIKE to happen is for pacemaker to attempt to run a "status"
on node B, PRIOR to stopping the service group on node A and attempting to
start the service group on node B.  Something like this behavior.
- Pacemaker 'failure-timeout' interval expires.
- Pacemaker checks the status of the LSB service (/etc/init.d/<lsb resource
name> status) which returns a bad error code.
- Pacemaker 'failure-timeout' interval expires.
- Pacemaker checks the status of the LSB service (/etc/init.d/<lsb resource
name> status) which returns a bad error code.

At which point an administrator or an automated script could intervene and
bring the backing service online, at which point we would have this
behavior:
- Pacemaker 'failure-timeout' interval expires.
- Pacemaker checks the status of the LSB service (/etc/init.d/<lsb resource
name> status) which returns a HEALTHY error code.
- Pacemaker shuts down the VIP B service group on node A.
- Pacemaker starts the VIP B service group on node B.

I have attached an obfuscated pastebin of my current Pacemaker
configuration, as well as a copy of the logs for the pacemaker service,
when the initial failure occurs, and also capturing the repetitive failed
attempts to start the LSB resource.


Obfuscated "crm configure show"

https://pastebin.com/emAw8juQ


Obfuscated "journalctl -fu pacemaker"

https://pastebin.com/kcnfCrjf



Please let me know if there is a configuration parameter I can place in my
config that would tell Pacemaker to perform a status check on the LSB
resource PRIOR to attempting to start the service group on it's preferred
node.

-- 
Michael Romero

Lead Infrastructure Engineer

Engineering | Convoso
562-338-9868
[email protected]
www.convoso.com
[image: linkedin] <https://linkedin.com/in/romerom>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Issue with Pacemaker config related to VIP and an LSB resource

Reply via email to