Re: [ClusterLabs] Issue with DRBD + a systemd resource

Andrei Borzenkov Wed, 13 Dec 2017 21:16:38 -0800


Отправлено с iPhone


> 13 дек. 2017 г., в 22:53, Julien Semaan <[email protected]> написал(а):
> 
> Hello,
> 
> Its my first post on this mailing list so excuse any rookie mistake I may do 
> in this thread.
> 
> We currently have clusters deployed using corosync/pacemaker that manage DRBD 
> + a couple of systemd services.
> 
> My colleague Derek previously emailed the list about it but has left the 
> company since then:
> http://lists.clusterlabs.org/pipermail/users/2017-November/006796.html
> 
> I'm hoping to continue his work in order to fix it once and for all.
> 
> I looked into the Q&A that was done in that thread and have managed to track 
> it down to the following:
> - If I reboot the server that is running as the primary (DRBD + systemd 
> resources started), then when it completes reboot, there is a split-brain
> - If I stop pacemaker (systemctl stop pacemaker), then reboot that primary 
> server, then it comes back online without any issues and no split-brain
> - If I reboot the server that doesn't have the running resources, all goes 
> well
> 
> Following those observations, my guess is that the way the pacemaker     
> services are being stopped during a systemd shutdown is causing issues.
> It seems that pacemaker isn't stopping the systemd resources in that case and 
> thus, not un-mounting the DRBD partition, putting it in secondary before 
> stopping DRBD which results in the split-brain.
> 


According to your log D-Bus is stopped before pacemaker. Try adding After 
dependency on dbus service to pacemaker.



> Here is the interesting bit I found in the logs:
> Dec 13 14:09:40 act-pass-2 lrmd[1133]:    error: Could not connect to System 
> DBus: Did not receive a reply. Possible causes include: the remote 
> application did not send a reply, the message bus security       policy 
> blocked the reply, the reply timeout expired, or the network connection was 
> broken.
> Dec 13 14:09:40 act-pass-2 lrmd[1133]:    error: systemd_unit_exec: Triggered 
> fatal assert at systemd.c:730 :       systemd_init()
> Dec 13 14:09:40 act-pass-2 pacemakerd[1083]:    error: Managed process 1133 
> (lrmd) dumped core
> Dec 13 14:09:40 act-pass-2 pacemakerd[1083]:    error: The lrmd process 
> (1133) terminated with signal 6 (core=1)
> 
> And a pastebin of the full journald output during the shutdown
> https://pastebin.com/CB38BiwC
> 
> Not sure where to go from there, may be a dependency to another systemd 
> resource but it seems more like an issue connecting to systemd itself to stop 
> the systemd resources of the cluster (that's a wild guess) since systemd 
> isn't accepting commands since its stopping. At this point, this goes beyond 
> my knowledge of systemd so I'd like some guidance on any required adjustment 
> or further necessary troubleshooting.
> 
> Best Regards,
> 
> -- 
> Julien Semaan
> [email protected]  ::  +1 (866) 353-6153 *155  ::  www.inverse.ca
> Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence 
> (www.packetfence.org) 
> _______________________________________________
> Users mailing list: [email protected]
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Issue with DRBD + a systemd resource

Reply via email to