Hi,
I have configured clusters of node pairs, so each cluster has 2 nodes. The
cluster members are statically defined in corosync.conf before corosync or
pacemaker is started, and quorum {two_node: 1} is set.
When both nodes are powered off and I power them on, they do not start
pacemaker at exactly the same time. The time difference may be a few minutes
depending on other factors outside the nodes.
My goals are (I call the first node to start pacemaker "node1"):
1) I want to control how long pacemaker on node1 waits before fencing node2 if
node2 does not start pacemaker.
2) If node1 is part-way through that waiting period, and node2 starts pacemaker
so they detect each other, I would like them to proceed immediately to probing
resource state and starting resources which are down, not wait until the end of
that "grace period".
It looks from the documentation like dc-deadtime is how #1 is controlled, and
#2 is expected normal behavior. However, I'm seeing fence actions before
dc-deadtime has passed.
Am I misunderstanding Pacemaker's expected behavior and/or how dc-deadtime
should be used?
One possibly unusual aspect of this cluster is that these two nodes are
stateless - they PXE boot from an image on another server - and I build the
cluster configuration at boot time with a series of pcs commands, because the
nodes have no local storage for this purpose. The commands are:
['pcs', 'cluster', 'start']
['pcs', 'property', 'set', 'stonith-action=off']
['pcs', 'property', 'set', 'cluster-recheck-interval=60']
['pcs', 'property', 'set', 'start-failure-is-fatal=false']
['pcs', 'property', 'set', 'dc-deadtime=300']
['pcs', 'stonith', 'create', 'fence_gopher11', 'fence_powerman',
'ip=192.168.64.65', 'pcmk_host_check=static-list',
'pcmk_host_list=gopher11,gopher12']
['pcs', 'stonith', 'create', 'fence_gopher12', 'fence_powerman',
'ip=192.168.64.65', 'pcmk_host_check=static-list',
'pcmk_host_list=gopher11,gopher12']
['pcs', 'resource', 'create', 'gopher11_zpool', 'ocf:llnl:zpool',
'import_options="-f -N -d /dev/disk/by-vdev"', 'pool=gopher11', 'op', 'start',
'timeout=805']
...
['pcs', 'property', 'set', 'no-quorum-policy=ignore']
I could, instead, generate a CIB so that when Pacemaker is started, it has a
full config. Is that better?
thanks,
Olaf
=== corosync.conf:
totem {
version: 2
cluster_name: gopher11
secauth: off
transport: udpu
}
nodelist {
node {
ring0_addr: gopher11
name: gopher11
nodeid: 1
}
node {
ring0_addr: gopher12
name: gopher12
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
=== Log excerpt
Here's an except from Pacemaker logs that reflect what I'm seeing. These are
from gopher12, the node that came up first. The other node, which is not yet
up, is gopher11.
Jan 25 17:55:38 gopher12 pacemakerd [116033] (main) notice:
Starting Pacemaker 2.1.7-1.t4 | build=2.1.7 features:agent-manpages ascii-docs
compat-2.0 corosync-ge-2 default-concurrent-fencing generated-manpages
monotonic nagios ncurses remote systemd
Jan 25 17:55:39 gopher12 pacemaker-controld [116040] (peer_update_callback)
info: Cluster node gopher12 is now member (was in unknown state)
Jan 25 17:55:43 gopher12 pacemaker-based [116035] (cib_perform_op) info:
++
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
<nvpair id="cib-bootstrap-options-dc-deadtime" name="dc-deadtime"
value="300"/>
Jan 25 17:56:00 gopher12 pacemaker-controld [116040] (crm_timer_popped)
info: Election Trigger just popped | input=I_DC_TIMEOUT time=300000ms
Jan 25 17:56:01 gopher12 pacemaker-based [116035] (cib_perform_op) info:
++
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
<nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy"
value="ignore"/>
Jan 25 17:56:01 gopher12 pacemaker-controld [116040] (abort_transition_graph)
info: Transition 0 aborted by cib-bootstrap-options-no-quorum-policy doing
create no-quorum-policy=ignore: Configuration change | cib=0.26.0
source=te_update_diff_v2:464
path=/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']
complete=true
Jan 25 17:56:01 gopher12 pacemaker-controld [116040]
(controld_execute_fence_action) notice: Requesting fencing (off) targeting
node gopher11 | action=11 timeout=60
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/