>>> Klaus Wenninger <[email protected]> schrieb am 23.05.2022 um 19:43 in Nachricht <CALrDAo18avdvPCfWGAG=_Pzc1zeqj7UhYL9T1JvZOG=eztm...@mail.gmail.com>: > On Fri, May 20, 2022 at 7:43 AM Ulrich Windl > <[email protected]> wrote: >> >> >>> Jan Friesse <[email protected]> schrieb am 19.05.2022 um 14:55 in >> Nachricht >> <[email protected]>: >> > Hi, >> > >> > On 19/05/2022 10:16, Leditzky, Fabian via Users wrote: >> >> Hello >> >> >> >> We have been dealing with our pacemaker/corosync clusters becoming >> unstable. >> >> The OS is Debian 10 and we use Debian packages for pacemaker and corosync, >> >> version 3.0.1‑5+deb10u1 and 3.0.1‑2+deb10u1 respectively. >> > >> > Seems like pcmk version is not so important for behavior you've >> > described. Corosync 3.0.1 is super old, are you able to reproduce the >> >> I'm running corosync-2.4.5-12.7.1.x86_64 (SLES15 SP3) here ;-) >> >> Are you mixing "super old" with "super buggy"? > > Actually 3.0.1 is older than 2.4.5 and on top 2.4.5 is the head of a mature > branch while 3.0.1 is the beginning of a new branch that brought > substantial changes.
Klaus, thanks for explaining. Wasn't aware of that. Regards, Ulrich > > Klaus >> >> Regards, >> Ulrich >> >> > behavior with 3.1.6? What is the version of knet? There were quite a few >> > fixes so last one (1.23) is really recommended. >> > >> > You can try to compile yourself, or use proxmox repo >> > (http://download.proxmox.com/debian/pve/) which contains newer version >> > of packages. >> > >> >> We use knet over UDP transport. >> >> >> >> We run multiple 2‑node and 4‑8 node clusters, primarily managing VIP >> > resources. >> >> The issue we experience presents itself as a spontaneous disagreement of >> >> the status of cluster members. In two node clusters, each node >> spontaneously >> >> sees the other node as offline, despite network connectivity being OK. >> >> In larger clusters, the status can be inconsistent across the nodes. >> >> E.g.: node1 sees 2,4 as offline, node 2 sees 1,4 as offline while node 3 >> and >> > 4 see every node as online. >> > >> > This really shouldn't happen. >> > >> >> The cluster becomes generally unresponsive to resource actions in this >> > state. >> > >> > Expected >> > >> >> Thus far we have been unable to restore cluster health without restarting >> > corosync. >> >> >> >> We are running packet captures 24/7 on the clusters and have custom >> tooling >> >> to detect lost UDP packets on knet ports. So far we could not see >> > significant >> >> packet loss trigger an event, at most we have seen a single UDP packet >> > dropped >> >> some seconds before the cluster fails. >> >> >> >> However, even if the root cause is indeed a flaky network, we do not >> > understand >> >> why the cluster cannot recover on its own in any way. The issues definitely >> >> > persist >> >> beyond the presence of any intermittent network problem. >> > >> > Try newer version. If problem persist, it's good idea to monitor if >> > packets are really passed thru. Corosync always (at least) creates >> > single node membership. >> > >> > Regards, >> > Honza >> > >> >> >> >> We were able to artificially break clusters by inducing packet loss with an >> >> > iptables rule. >> >> Dropping packets on a single node of an 8‑node cluster can cause >> malfunctions >> > on >> >> multiple other cluster nodes. The expected behavior would be detecting that >> >> > the >> >> artificially broken node failed but keeping the rest of the cluster >> stable. >> >> We were able to reproduce this also on Debian 11 with more recent >> > corosync/pacemaker >> >> versions. >> >> >> >> Our configuration basic, we do not significantly deviate from the >> defaults. >> >> >> >> We will be very grateful for any insights into this problem. >> >> >> >> Thanks, >> >> Fabian >> >> >> >> // corosync.conf >> >> totem { >> >> version: 2 >> >> cluster_name: cluster01 >> >> crypto_cipher: aes256 >> >> crypto_hash: sha512 >> >> transport: knet >> >> } >> >> logging { >> >> fileline: off >> >> to_stderr: no >> >> to_logfile: no >> >> to_syslog: yes >> >> debug: off >> >> timestamp: on >> >> logger_subsys { >> >> subsys: QUORUM >> >> debug: off >> >> } >> >> } >> >> quorum { >> >> provider: corosync_votequorum >> >> two_node: 1 >> >> expected_votes: 2 >> >> } >> >> nodelist { >> >> node { >> >> name: node01 >> >> nodeid: 01 >> >> ring0_addr: 10.0.0.10 >> >> } >> >> node { >> >> name: node02 >> >> nodeid: 02 >> >> ring0_addr: 10.0.0.11 >> >> } >> >> } >> >> >> >> // crm config show >> >> node 1: node01 \ >> >> attributes standby=off >> >> node 2: node02 \ >> >> attributes standby=off maintenance=off >> >> primitive IP‑clusterC1 IPaddr2 \ >> >> params ip=10.0.0.20 nic=eth0 cidr_netmask=24 \ >> >> meta migration‑threshold=2 target‑role=Started is‑managed=true \ >> >> op monitor interval=20 timeout=60 on‑fail=restart >> >> primitive IP‑clusterC2 IPaddr2 \ >> >> params ip=10.0.0.21 nic=eth0 cidr_netmask=24 \ >> >> meta migration‑threshold=2 target‑role=Started is‑managed=true \ >> >> op monitor interval=20 timeout=60 on‑fail=restart >> >> location STICKY‑IP‑clusterC1 IP‑clusterC1 100: node01 >> >> location STICKY‑IP‑clusterC2 IP‑clusterC2 100: node02 >> >> property cib‑bootstrap‑options: \ >> >> have‑watchdog=false \ >> >> dc‑version=2.0.1‑9e909a5bdd \ >> >> cluster‑infrastructure=corosync \ >> >> cluster‑name=cluster01 \ >> >> stonith‑enabled=no \ >> >> no‑quorum‑policy=ignore \ >> >> last‑lrm‑refresh=1632230917 >> >> >> >> >> >> ________________________________ >> >> [https://go.aciworldwide.com/rs/030‑ROK‑804/images/aci‑footer.jpg] >> > <http://www.aciworldwide.com> >> >> This email message and any attachments may contain confidential, >> proprietary >> > or non‑public information. The information is intended solely for the >> > designated recipient(s). If an addressing or transmission error has >> > misdirected this email, please notify the sender immediately and destroy >> this >> > email. Any review, dissemination, use or reliance upon this information by >> > unintended recipients is prohibited. Any opinions expressed in this email >> are >> > those of the author personally. >> >> _______________________________________________ >> >> Manage your subscription: >> >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> >> > >> > _______________________________________________ >> > Manage your subscription: >> > https://lists.clusterlabs.org/mailman/listinfo/users >> > >> > ClusterLabs home: https://www.clusterlabs.org/ >> >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
