[ClusterLabs] Incorrect Node Fencing Issue in Lustre Cluster During Network Failure Simulation

[email protected] Mon, 09 Jun 2025 20:54:42 -0700

Background:
There are 4 physical machines, with two virtual machines running on each 
physical machine. lustre-mds-nodexx runs the Lustre MDS server, and 
lustre-oss-nodexx runs the Lustre OSS service. Each virtual machine is directly 
connected to two network interfaces, service1(ens6f0np0) and 
service2(ens6f1np1). Pacemaker is used to ensure high availability of the 
Lustre services.
Software versions:
Lustre: 2.15.5
Corosync: 3.1.5
Pacemaker: 2.1.0-8.el8
PCS: 0.10.8



Operation:
During testing, the network interfaces service1 and service2 on 
lustre-oss-node40 and lustre-mds-node40 were repeatedly brought up and down 
every 20 seconds (to simulate a network failure).

for i in {1..10}; do date; ifconfig ens6f0np0 down && ifconfig ens6f1np1 down; 
sleep 20; date; ifconfig ens6f0np0 up && ifconfig ens6f1np1 up; date;sleep 30


Issue:
Theoretically, lustre-oss-node40 and lustre-mds-node40 should have been fenced, 
but lustre-mds-node32 was fenced instead.


Related Logs:
Jun 09 17:54:51 node32 fence_virtd[2502]: Destroying domain 
60e80c07-107e-4e8a-ba42-39e48b3e6bb7   // This log indicates that 
lustre-mds-node32 was fenced.


* turning off of lustre-mds-node32 successful: delegate=lustre-mds-node42, 
client=pacemaker-controld.8918, origin=lustre-mds-node42, completed='2025-06-09 
17:54:54.527116 +08:00'



Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info    [KNET  ] link: 
Resetting MTU for link 0 because host 1 joined
Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info    [KNET  ] host: host: 
1 (passive) best link: 0 (pri: 1)
Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info    [KNET  ] pmtud: 
Global data MTU changed to: 1397
Jun 09 17:54:31 [1429] lustre-mds-node32 corosync info    [KNET  ] link: host: 
1 link: 0 is down
Jun 09 17:54:31 [1429] lustre-mds-node32 corosync info    [KNET  ] host: host: 
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync info    [KNET  ] link: host: 
1 link: 1 is down
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync info    [KNET  ] host: host: 
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1429] lustre-mds-node32 corosync warning [KNET  ] host: host: 
1 has no active links
Jun 09 17:54:36 [1429] lustre-mds-node32 corosync notice  [TOTEM ] Token has 
not been received in 8475 ms
Jun 09 17:57:44 [1419] lustre-mds-node32 corosync notice  [MAIN  ] Corosync 
Cluster Engine 3.1.8 starting up


Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 
4 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 
3 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 
2 link: 0 is down
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 
4 (passive) best link: 1 (pri: 1)
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 
3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 
2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync notice  [TOTEM ] Token has 
not been received in 8475 ms
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 
4 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 
3 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] link: host: 
2 link: 1 is down
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 
4 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET  ] host: host: 
4 has no active links
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 
3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET  ] host: host: 
3 has no active links
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 
2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET  ] host: host: 
2 has no active links
Jun 09 17:54:37 [1412] lustre-mds-node40 corosync notice  [TOTEM ] A processor 
failed, forming new configuration: token timed out (11300ms), waiting 13560ms 
for consensus.
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info    [KNET  ] link: 
Resetting MTU for link 1 because host 3 joined
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 
3 (passive) best link: 1 (pri: 1)
Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info    [KNET  ] pmtud: 
Global data MTU changed to: 1397
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info    [KNET  ] link: 
Resetting MTU for link 1 because host 2 joined
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info    [KNET  ] host: host: 
2 (passive) best link: 1 (pri: 1)
Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info    [KNET  ] pmtud: 
Global data MTU changed to: 1397
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice  [QUORUM] Sync 
members[3]: 1 2 3
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice  [QUORUM] Sync 
left[1]: 4
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice  [TOTEM ] A new 
membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice  [TOTEM ] Failed to 
receive the leave message. failed: 4


Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info    [KNET  ] link: host: 
1 link: 0 is down
Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info    [KNET  ] host: host: 
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info    [KNET  ] link: host: 
1 link: 1 is down
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info    [KNET  ] host: host: 
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8913] lustre-mds-node41 corosync warning [KNET  ] host: host: 
1 has no active links
Jun 09 17:54:36 [8913] lustre-mds-node41 corosync notice  [TOTEM ] Token has 
not been received in 8475 ms
Jun 09 17:54:39 [8913] lustre-mds-node41 corosync notice  [TOTEM ] A processor 
failed, forming new configuration: token timed out (11300ms), waiting 13560ms 
for consensus.
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info    [KNET  ] rx: host: 1 
link: 1 is up
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info    [KNET  ] link: 
Resetting MTU for link 1 because host 1 joined
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info    [KNET  ] host: host: 
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info    [KNET  ] pmtud: 
Global data MTU changed to: 1397
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice  [QUORUM] Sync 
members[3]: 1 2 3
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice  [QUORUM] Sync 
left[1]: 4
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice  [TOTEM ] A new 
membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice  [TOTEM ] Failed to 
receive the leave message. failed: 4


Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info    [KNET  ] link: host: 
1 link: 0 is down
Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info    [KNET  ] host: host: 
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info    [KNET  ] link: host: 
1 link: 1 is down
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info    [KNET  ] host: host: 
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:30 [8900] lustre-mds-node42 corosync warning [KNET  ] host: host: 
1 has no active links
Jun 09 17:54:36 [8900] lustre-mds-node42 corosync notice  [TOTEM ] Token has 
not been received in 8475 ms
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info    [KNET  ] rx: host: 1 
link: 1 is up
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info    [KNET  ] link: 
Resetting MTU for link 1 because host 1 joined
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info    [KNET  ] host: host: 
1 (passive) best link: 1 (pri: 1)
Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info    [KNET  ] pmtud: 
Global data MTU changed to: 1397
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice  [QUORUM] Sync 
members[3]: 1 2 3
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice  [QUORUM] Sync 
left[1]: 4
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice  [TOTEM ] A new 
membership (1.45) was formed. Members left: 4
Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice  [TOTEM ] Failed to 
receive the leave message. failed: 4




/etc/corosync/corosync.conf
totem {
    version: 2
    cluster_name: mds_cluster
    transport: knet
    crypto_cipher: aes256
    crypto_hash: sha256
    cluster_uuid: 11f2c4097ac44d5981769a9ed579c99e
    token: 10000
}

nodelist {
    node {
        ring0_addr: 10.255.153.240
        ring1_addr: 10.255.153.241
        name: lustre-mds-node40
        nodeid: 1
    }

    node {
        ring0_addr: 10.255.153.244
        ring1_addr: 10.255.153.245
        name: lustre-mds-node41
        nodeid: 2
    }

    node {
        ring0_addr: 10.255.153.248
        ring1_addr: 10.255.153.249
        name: lustre-mds-node42
        nodeid: 3
    }

    node {
        ring0_addr: 10.255.153.236
        ring1_addr: 10.255.153.237
        name: lustre-mds-node32
        nodeid: 4
    }
}

quorum {
    provider: corosync_votequorum
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
}






[email protected]

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Incorrect Node Fencing Issue in Lustre Cluster During Network Failure Simulation

Reply via email to