Отправлено с iPhone
> 13 дек. 2017 г., в 22:53, Julien Semaan <[email protected]> написал(а): > > Hello, > > Its my first post on this mailing list so excuse any rookie mistake I may do > in this thread. > > We currently have clusters deployed using corosync/pacemaker that manage DRBD > + a couple of systemd services. > > My colleague Derek previously emailed the list about it but has left the > company since then: > http://lists.clusterlabs.org/pipermail/users/2017-November/006796.html > > I'm hoping to continue his work in order to fix it once and for all. > > I looked into the Q&A that was done in that thread and have managed to track > it down to the following: > - If I reboot the server that is running as the primary (DRBD + systemd > resources started), then when it completes reboot, there is a split-brain > - If I stop pacemaker (systemctl stop pacemaker), then reboot that primary > server, then it comes back online without any issues and no split-brain > - If I reboot the server that doesn't have the running resources, all goes > well > > Following those observations, my guess is that the way the pacemaker > services are being stopped during a systemd shutdown is causing issues. > It seems that pacemaker isn't stopping the systemd resources in that case and > thus, not un-mounting the DRBD partition, putting it in secondary before > stopping DRBD which results in the split-brain. > According to your log D-Bus is stopped before pacemaker. Try adding After dependency on dbus service to pacemaker. > Here is the interesting bit I found in the logs: > Dec 13 14:09:40 act-pass-2 lrmd[1133]: error: Could not connect to System > DBus: Did not receive a reply. Possible causes include: the remote > application did not send a reply, the message bus security policy > blocked the reply, the reply timeout expired, or the network connection was > broken. > Dec 13 14:09:40 act-pass-2 lrmd[1133]: error: systemd_unit_exec: Triggered > fatal assert at systemd.c:730 : systemd_init() > Dec 13 14:09:40 act-pass-2 pacemakerd[1083]: error: Managed process 1133 > (lrmd) dumped core > Dec 13 14:09:40 act-pass-2 pacemakerd[1083]: error: The lrmd process > (1133) terminated with signal 6 (core=1) > > And a pastebin of the full journald output during the shutdown > https://pastebin.com/CB38BiwC > > Not sure where to go from there, may be a dependency to another systemd > resource but it seems more like an issue connecting to systemd itself to stop > the systemd resources of the cluster (that's a wild guess) since systemd > isn't accepting commands since its stopping. At this point, this goes beyond > my knowledge of systemd so I'd like some guidance on any required adjustment > or further necessary troubleshooting. > > Best Regards, > > -- > Julien Semaan > [email protected] :: +1 (866) 353-6153 *155 :: www.inverse.ca > Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence > (www.packetfence.org) > _______________________________________________ > Users mailing list: [email protected] > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
