On Thu, May 24, 2018 at 10:40 AM, Ken Gaillot <[email protected]> wrote: > On Thu, 2018-05-24 at 16:14 +0200, Klaus Wenninger wrote: >> On 05/24/2018 04:03 PM, Ken Gaillot wrote: >> > On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote: >> > > On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov <arvidjaar@gma >> > > il.c >> > > om> wrote: >> > > > 24.05.2018 02:57, Jason Gauthier пишет: >> > > > > I'm fairly new to clustering under Linux. I've basically >> > > > > have >> > > > > one shared >> > > > > storage resource right now, using dlm, and gfs2. >> > > > > I'm using fibre channel and when both of my nodes are up (2 >> > > > > node >> > > > > cluster) >> > > > > dlm and gfs2 seem to be operating perfectly. >> > > > > If I reboot node B, node A works fine and vice-versa. >> > > > > >> > > > > When node B goes offline unexpectedly, and become unclean, >> > > > > dlm >> > > > > seems to >> > > > > block all IO to the shared storage. >> > > > > >> > > > > dlm knows node B is down: >> > > > > >> > > > > # dlm_tool status >> > > > > cluster nodeid 1084772368 quorate 1 ring seq 32644 32644 >> > > > > daemon now 865695 fence_pid 18186 >> > > > > fence 1084772369 nodedown pid 18186 actor 1084772368 fail >> > > > > 1527119246 fence >> > > > > 0 now 1527119524 >> > > > > node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0 >> > > > > node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 >> > > > > at 0 >> > > > > 0 >> > > > > >> > > > > on the same server, I see these messages in my daemon.log >> > > > > May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: >> > > > > Could >> > > > > not kick >> > > > > (reboot) node 1084772369/(null) : No route to host (-113) >> > > > > May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error >> > > > > -113 >> > > > > nodeid >> > > > > 1084772369 >> > > > > >> > > > > I can recover from the situation by forcing it (or bring the >> > > > > other node >> > > > > back online) >> > > > > dlm_tool fence_ack 1084772369 >> > > > > >> > > > > cluster config is pretty straighforward. >> > > > > node 1084772368: alpha >> > > > > node 1084772369: beta >> > > > > primitive p_dlm_controld ocf:pacemaker:controld \ >> > > > > op monitor interval=60 timeout=60 \ >> > > > > meta target-role=Started \ >> > > > > params args="-K -L -s 1" >> > > > > primitive p_fs_gfs2 Filesystem \ >> > > > > params device="/dev/sdb2" directory="/vms" >> > > > > fstype=gfs2 >> > > > > primitive stonith_sbd stonith:external/sbd \ >> > > > > params pcmk_delay_max=30 sbd_device="/dev/sdb1" \ >> > > > > meta target-role=Started >> > > > >> > > > What is the status of stonith resource? Did you configure SBD >> > > > fencing >> > > > properly? >> > > >> > > I believe so. It's shown above in my cluster config. >> > > >> > > > Is sbd daemon up and running with proper parameters? >> > > >> > > Well, no, apparently sbd isn't running. With dlm, and gfs2, >> > > the >> > > cluster controls handling launching of the daemons. >> > > I assumed the same here, since the resource shows that it is up. >> > >> > Unlike other services, sbd must be up before the cluster starts in >> > order for the cluster to use it properly. (Notice the "have- >> > watchdog=false" in your cib-bootstrap-options ... that means the >> > cluster didn't find sbd running.) >> > >> > Also, even storage-based sbd requires a working hardware watchdog >> > for >> > the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd >> > should >> > list the watchdog device. Also sbd_device in your cluster config >> > should >> > match SBD_DEVICE in /etc/sysconfig/sbd. >> > >> > If you want the cluster to recover services elsewhere after a node >> > self-fences (which I'm sure you do), you also need to set the >> > stonith- >> > watchdog-timeout cluster property to something greater than the >> > value >> > of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will >> > wait >> > that long and then assume the node fenced itself.
Thanks. So, for whatever reason, sbd was not running. I went ahead and got /etc/default/sbd (debian) configured. I can't start the service manually due to dependencies, but I rebooted node B and it came up. Node A would not, I ended up rebooting both nodes at the same time, and sbd was running on both. I forced a failure of node B, and after a few seconds node A was able to access the shared storage. Definite improvement! >> Actually for the case that there is a shared disk a successful >> fencing-attempt via the sbd-fencing-resource should be enough >> for the node to be assumed down. >> In case of a 2-node-setup I would even discourage setting >> stonith-watchdog-timeout as we need a real quorum-mechanism >> for that to work. > > Ah, thanks -- I've updated the wiki how-to, feel free to clarify > further: > > https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker > >> >> Regards, >> Klaus >> >> > >> > > Online: [ alpha beta ] >> > > >> > > Full list of resources: >> > > >> > > stonith_sbd (stonith:external/sbd): Started alpha >> > > Clone Set: cl_gfs2 [g_gfs2] >> > > Started: [ alpha beta ] >> > > >> > > >> > > > What is output of >> > > > sbd -d /dev/sdb1 dump >> > > > sbd -d /dev/sdb1 list >> > > >> > > Both nodes seem fine. >> > > >> > > 0 alpha test beta >> > > 1 beta test alpha >> > > >> > > >> > > > on both nodes? Does >> > > > >> > > > sbd -d /dev/sdb1 message <other-node> test >> > > > >> > > > work in both directions? >> > > >> > > It doesn't return an error, yet without a daemon running, I don't >> > > think the message is received either. >> > > >> > > >> > > > Does manual fencing using stonith_admin work? >> > > >> > > I'm not sure at the moment. I think I need to look into why the >> > > daemon isn't running. >> > > >> > > > > group g_gfs2 p_dlm_controld p_fs_gfs2 >> > > > > clone cl_gfs2 g_gfs2 \ >> > > > > meta interleave=true target-role=Started >> > > > > location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha >> > > > > property cib-bootstrap-options: \ >> > > > > have-watchdog=false \ >> > > > > dc-version=1.1.16-94ff4df \ >> > > > > cluster-infrastructure=corosync \ >> > > > > cluster-name=zeta \ >> > > > > last-lrm-refresh=1525523370 \ >> > > > > stonith-enabled=true \ >> > > > > stonith-timeout=20s >> > > > > >> > > > > Any pointers would be appreciated. I feel like this should be >> > > > > working but >> > > > > I'm not sure if I've missed something. >> > > > > >> > > > > Thanks, >> > > > > >> > > > > Jason >> > > > > >> > > > > >> > > > > >> > > > > _______________________________________________ >> > > > > Users mailing list: [email protected] >> > > > > https://lists.clusterlabs.org/mailman/listinfo/users >> > > > > >> > > > > Project Home: http://www.clusterlabs.org >> > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_ >> > > > > Scra >> > > > > tch.pdf >> > > > > Bugs: http://bugs.clusterlabs.org >> > > > > >> > > > >> > > > _______________________________________________ >> > > > Users mailing list: [email protected] >> > > > https://lists.clusterlabs.org/mailman/listinfo/users >> > > > >> > > > Project Home: http://www.clusterlabs.org >> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc >> > > > ratc >> > > > h.pdf >> > > > Bugs: http://bugs.clusterlabs.org >> > > >> > > _______________________________________________ >> > > Users mailing list: [email protected] >> > > https://lists.clusterlabs.org/mailman/listinfo/users >> > > >> > > Project Home: http://www.clusterlabs.org >> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra >> > > tch. >> > > pdf >> > > Bugs: http://bugs.clusterlabs.org >> >> > -- > Ken Gaillot <[email protected]> > _______________________________________________ > Users mailing list: [email protected] > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
