Root cause:

1) When corosync is restarted it may take up to a minute for it to
finish setting up.

2) The systemd timeout value is exceeded.
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: Failed to start Corosync 
Cluster Engine.
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: corosync.service: Unit 
entered failed state.
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: corosync.service: Failed with 
result 'timeout'.

3) Pacemaker is then started. Pacemaker systemd script has a dependency
on corosync which may still be in the process of comming up.

4) Pacemaker fails to start due to dependency 
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: pacemaker.service: Job 
pacemaker.service/start failed with result 'dependency'.

5) Pacemaker remains down.

6) Subsequently, the charm checks for pacemaker health by running `crm
node list` in a loop until it succeeds.

7) This is an infinite loop.


Soulitions

1) Adding corosync to this bug for systemd script timeout change

2) Charm needs to better handle validation of restart of the services
and better communicate to the end user when an error has occured


Current Work in Process
https://review.openstack.org/#/c/419204/


** Also affects: corosync (Ubuntu)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1654403

Title:
  Race condition in hacluster charm that leaves pacemaker down

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1654403/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to