Background:
There are 4 physical machines, with two virtual machines running on each
physical machine. lustre-mds-nodexx runs the Lustre MDS server, and
lustre-oss-nodexx runs the Lustre OSS service. Each virtual machine is directly
connected to two network interfaces, service1(ens6f0np0) and
service2(ens6f1np1). Pacemaker is used to ensure high availability of the
Lustre services.
Software versions:
Lustre: 2.15.5
Corosync: 3.1.5
Pacemaker: 2.1.0-8.el8
PCS: 0.10.8
Operation:
During testing, the network interfaces service1 and service2 on
lustre-oss-node40 and lustre-mds-node40 were repeatedly brought up and down
every 20 seconds (to simulate a network failure).
for i in {1..10}; do date; ifconfig ens6f0np0 down && ifconfig ens6f1np1 down;
sleep 20; date; ifconfig ens6f0np0 up && ifconfig ens6f1np1 up; date;sleep 30
Issue:
Theoretically, lustre-oss-node40 and lustre-mds-node40 should have been fenced,
but lustre-mds-node32 was fenced instead.
Related Logs:
Jun 09 17:54:51 node32 fence_virtd[2502]: Destroying domain
60e80c07-107e-4e8a-ba42-39e48b3e6bb7 // This log indicates that
lustre-mds-node32 was fenced.
* turning off of lustre-mds-node32 successful: delegate=lustre-mds-node42,
client=pacemaker-controld.8918, origin=lustre-mds-node42, completed='2025-06-09
17:54:54.527116 +08:00'
Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info [KNET ] link:
Resetting MTU for link 0 because host 1 joined
Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info [KNET ] host: host:
1 (passive) best link: 0 (pri: 1)
Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info [KNET ] pmtud:
Global data MTU changed to: 1397
Jun 09 17:54:31 [1429] lustre-mds-node32 corosync info [KNET ] link: host:
1 link: 0 is down
Jun 09 17:54:31 [1429] lustre-mds-node32 corosync info [KNET ] host: host:
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync info [KNET ] link: host:
1 link: 1 is down
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync info [KNET ] host: host:
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync warning [KNET ] host: host:
1 has no active links
Jun 09 17:54:36 [1429] lustre-mds-node32 corosync notice [TOTEM ] Token has
not been received in 8475 ms
Jun 09 17:57:44 [1419] lustre-mds-node32 corosync notice [MAIN ] Corosync
Cluster Engine 3.1.8 starting up
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] link: host:
4 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] link: host:
3 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] link: host:
2 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] host: host:
4 (passive) best link: 1 (pri: 1)
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] host: host:
3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] host: host:
2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync notice [TOTEM ] Token has
not been received in 8475 ms
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] link: host:
4 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] link: host:
3 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] link: host:
2 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] host: host:
4 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET ] host: host:
4 has no active links
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] host: host:
3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET ] host: host:
3 has no active links
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] host: host:
2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET ] host: host:
2 has no active links
Jun 09 17:54:37 [1412] lustre-mds-node40 corosync notice [TOTEM ] A processor
failed, forming new configuration: token timed out (11300ms), waiting 13560ms
for consensus.
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info [KNET ] link:
Resetting MTU for link 1 because host 3 joined
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info [KNET ] host: host:
3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info [KNET ] pmtud:
Global data MTU changed to: 1397
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info [KNET ] link:
Resetting MTU for link 1 because host 2 joined
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info [KNET ] host: host:
2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info [KNET ] pmtud:
Global data MTU changed to: 1397
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [QUORUM] Sync
members[3]: 1 2 3
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [QUORUM] Sync
left[1]: 4
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [TOTEM ] A new
membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [TOTEM ] Failed to
receive the leave message. failed: 4
Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info [KNET ] link: host:
1 link: 0 is down
Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info [KNET ] host: host:
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info [KNET ] link: host:
1 link: 1 is down
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info [KNET ] host: host:
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync warning [KNET ] host: host:
1 has no active links
Jun 09 17:54:36 [8913] lustre-mds-node41 corosync notice [TOTEM ] Token has
not been received in 8475 ms
Jun 09 17:54:39 [8913] lustre-mds-node41 corosync notice [TOTEM ] A processor
failed, forming new configuration: token timed out (11300ms), waiting 13560ms
for consensus.
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] rx: host: 1
link: 1 is up
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] link:
Resetting MTU for link 1 because host 1 joined
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] host: host:
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] pmtud:
Global data MTU changed to: 1397
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [QUORUM] Sync
members[3]: 1 2 3
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [QUORUM] Sync
left[1]: 4
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [TOTEM ] A new
membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [TOTEM ] Failed to
receive the leave message. failed: 4
Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info [KNET ] link: host:
1 link: 0 is down
Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info [KNET ] host: host:
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info [KNET ] link: host:
1 link: 1 is down
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info [KNET ] host: host:
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync warning [KNET ] host: host:
1 has no active links
Jun 09 17:54:36 [8900] lustre-mds-node42 corosync notice [TOTEM ] Token has
not been received in 8475 ms
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] rx: host: 1
link: 1 is up
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] link:
Resetting MTU for link 1 because host 1 joined
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] host: host:
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] pmtud:
Global data MTU changed to: 1397
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [QUORUM] Sync
members[3]: 1 2 3
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [QUORUM] Sync
left[1]: 4
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [TOTEM ] A new
membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [TOTEM ] Failed to
receive the leave message. failed: 4
/etc/corosync/corosync.conf
totem {
version: 2
cluster_name: mds_cluster
transport: knet
crypto_cipher: aes256
crypto_hash: sha256
cluster_uuid: 11f2c4097ac44d5981769a9ed579c99e
token: 10000
}
nodelist {
node {
ring0_addr: 10.255.153.240
ring1_addr: 10.255.153.241
name: lustre-mds-node40
nodeid: 1
}
node {
ring0_addr: 10.255.153.244
ring1_addr: 10.255.153.245
name: lustre-mds-node41
nodeid: 2
}
node {
ring0_addr: 10.255.153.248
ring1_addr: 10.255.153.249
name: lustre-mds-node42
nodeid: 3
}
node {
ring0_addr: 10.255.153.236
ring1_addr: 10.255.153.237
name: lustre-mds-node32
nodeid: 4
}
}
quorum {
provider: corosync_votequorum
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
timestamp: on
}
[email protected]
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/