Can you share the cluster configuration (e.g., `pcs config` or the CIB)? And are there any additional LogAction messages after that one (e.g., Promote for node01)?
On Mon, Jan 18, 2021 at 7:47 PM Stuart Massey <[email protected]> wrote: > So, we have a 2-node cluster with a quorum device. One of the nodes > (node1) is having some trouble, so we have added constraints to prevent any > resources migrating to it, but have not put it in standby, so that drbd in > secondary on that node stays in sync. The problems it is having lead to OS > lockups that eventually resolve themselves - but that causes it to be > temporarily dropped from the cluster by the current master (node2). > Sometimes when node1 rejoins, then node2 will demote the drbd ms resource. > That causes all resources that depend on it to be stopped, leading to a > service outage. They are then restarted on node2, since they can't run on > node1 (due to constraints). > We are having a hard time understanding why this happens. It seems like > there may be some sort of DC contention happening. Does anyone have any > idea how we might prevent this from happening? > Selected messages (de-identified) from pacemaker.log that illustrate > suspicion re DC confusion are below. The update_dc and > abort_transition_graph re deletion of lrm seem to always precede the > demotion, and a demotion seems to always follow (when not already demoted). > > Jan 18 16:52:17 [21938] node02.example.com crmd: info: > do_dc_takeover: Taking over DC status for this partition > Jan 18 16:52:17 [21938] node02.example.com crmd: info: > update_dc: Set DC to node02.example.com (3.0.14) > Jan 18 16:52:17 [21938] node02.example.com crmd: info: > abort_transition_graph: Transition aborted by deletion of > lrm[@id='1']: Resource state removal | cib=0.89.327 > source=abort_unless_down:357 > path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true > Jan 18 16:52:19 [21937] node02.example.com pengine: info: > master_color: ms_drbd_ourApp: Promoted 0 instances of a possible 1 to > master > Jan 18 16:52:19 [21937] node02.example.com pengine: notice: > LogAction: * Demote drbd_ourApp:1 ( Master -> Slave > node02.example.com ) > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > -- Regards, Reid Wahl, RHCA Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery - ClusterHA
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
