Root cause: 1) When corosync is restarted it may take up to a minute for it to finish setting up.
2) The systemd timeout value is exceeded. Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: Failed to start Corosync Cluster Engine. Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: corosync.service: Unit entered failed state. Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: corosync.service: Failed with result 'timeout'. 3) Pacemaker is then started. Pacemaker systemd script has a dependency on corosync which may still be in the process of comming up. 4) Pacemaker fails to start due to dependency Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: pacemaker.service: Job pacemaker.service/start failed with result 'dependency'. 5) Pacemaker remains down. 6) Subsequently, the charm checks for pacemaker health by running `crm node list` in a loop until it succeeds. 7) This is an infinite loop. Soulitions 1) Adding corosync to this bug for systemd script timeout change 2) Charm needs to better handle validation of restart of the services and better communicate to the end user when an error has occured Current Work in Process https://review.openstack.org/#/c/419204/ ** Also affects: corosync (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1654403 Title: Race condition in hacluster charm that leaves pacemaker down To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1654403/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs