Classified as: {OPEN}
Hummm… my RHEL 8.8 OS has been hardened.
I am wondering if the problem does not come from that.
On another side, I get the same issue (i.e. corosync not restarted by system)
with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).
I’m checking.
{OPEN}
De : Users <[email protected]> De la part de NOLIBOS Christophe via
Users
Envoyé : jeudi 18 avril 2024 18:34
À : Klaus Wenninger <[email protected]>; Cluster Labs - All topics related to
open-source clustering welcomed <[email protected]>
Cc : NOLIBOS Christophe <[email protected]>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix
Classified as: {OPEN}
So, the issue is on systemd?
If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker
1.1.13-10, corosync is correctly restarted by systemd.
[RHEL7 ~]# journalctl -f
-- Logs begin at Wed 2024-01-03 13:15:41 UTC. --
Apr 18 16:26:55 - systemd[1]: corosync.service failed.
Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over, scheduling
restart.
Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...
Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine (corosync):
[ OK ]
Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.
Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster
Manager.
Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster
Manager...
Apr 18 16:26:55 - pacemakerd[12192]: notice: Additional logging available in
/var/log/pacemaker.log
Apr 18 16:26:55 - pacemakerd[12192]: notice: Switching to
/var/log/cluster/corosync.log
Apr 18 16:26:55 - pacemakerd[12192]: notice: Additional logging available in
/var/log/cluster/corosync.log
De : Klaus Wenninger <[email protected] <mailto:[email protected]> >
Envoyé : jeudi 18 avril 2024 18:12
À : NOLIBOS Christophe <[email protected]
<mailto:[email protected]> >; Cluster Labs - All topics
related to open-source clustering welcomed <[email protected]
<mailto:[email protected]> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix
On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger <[email protected]
<mailto:[email protected]> > wrote:
On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe
<[email protected] <mailto:[email protected]>
> wrote:
Classified as: {OPEN}
Well… why do you say that « Well if corosync isn't there that this is to be
expected and pacemaker won't recover corosync.”?
In my mind, Corosync is managed by Pacemaker as any other cluster resource and
the "pacemakerd: recover properly from > Corosync crash" fix implemented in
version 2.1.2 seems confirm that.
Nope. Startup of the stack is done by systemd. And pacemaker is just started
after corosync is up and
systemd should be responsible for keeping the stack up.
For completeness: if you have sbd in the mix that is as well being started by
systemd but kind of
parallel with corosync as part of it (systemd terminology).
The "recover" above is referring to pacemaker recovering from corosync going
away and coming back.
Klaus
{OPEN}
{OPEN}
De : NOLIBOS Christophe
Envoyé : jeudi 18 avril 2024 17:56
À : 'Klaus Wenninger' <[email protected] <mailto:[email protected]> >;
Cluster Labs - All topics related to open-source clustering welcomed
<[email protected] <mailto:[email protected]> >
Cc : Ken Gaillot <[email protected] <mailto:[email protected]> >
Objet : RE: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix
Classified as: {OPEN}
[~]$ systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor
preset: disabled)
Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; 53min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited,
status=0/SUCCESS)
Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS
(code=killed, signal=KILL)
Main PID: 1324906 (code=killed, signal=KILL)
Apr 18 13:16:04 - corosync[1324906]: [QUORUM] Sync joined[1]: 1
Apr 18 13:16:04 - corosync[1324906]: [TOTEM ] A new membership (1.1c8) was
formed. Members joined: 1
Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster
members. Current votes: 1 expected_votes: 2
Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster
members. Current votes: 1 expected_votes: 2
Apr 18 13:16:04 - corosync[1324906]: [VOTEQ ] Waiting for all cluster
members. Current votes: 1 expected_votes: 2
Apr 18 13:16:04 - corosync[1324906]: [QUORUM] Members[1]: 1
Apr 18 13:16:04 - corosync[1324906]: [MAIN ] Completed service
synchronization, ready to provide service.
Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.
Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited,
code=killed, status=9/KILL
Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result 'signal'.
[~]$
De : Klaus Wenninger <[email protected] <mailto:[email protected]> >
Envoyé : jeudi 18 avril 2024 17:43
À : Cluster Labs - All topics related to open-source clustering welcomed
<[email protected] <mailto:[email protected]> >
Cc : Ken Gaillot <[email protected] <mailto:[email protected]> >; NOLIBOS
Christophe <[email protected]
<mailto:[email protected]> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix
On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users
<[email protected] <mailto:[email protected]> > wrote:
Classified as: {OPEN}
I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
When I kill Corosync, no new corosync process is created and pacemaker is in
failure.
The only solution is to restart the pacemaker service.
[~]$ pcs status
Error: unable to get cib
[~]$
[~]$systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor
preset: disabled)
Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min ago
Docs: man:pacemakerd
https://clusterlabs.org/pacemaker/doc/
Main PID: 1324923 (pacemakerd)
Tasks: 91
Memory: 132.1M
CGroup: /system.slice/pacemaker.service
...
Apr 18 14:59:02 - pacemakerd[1324923]: crit: Could not connect to Corosync
CFG: CS_ERR_LIBRARY
Apr 18 14:59:03 - pacemakerd[1324923]: crit: Could not connect to Corosync
CFG: CS_ERR_LIBRARY
Apr 18 14:59:04 - pacemakerd[1324923]: crit: Could not connect to Corosync
CFG: CS_ERR_LIBRARY
Apr 18 14:59:05 - pacemakerd[1324923]: crit: Could not connect to Corosync
CFG: CS_ERR_LIBRARY
Apr 18 14:59:06 - pacemakerd[1324923]: crit: Could not connect to Corosync
CFG: CS_ERR_LIBRARY
Apr 18 14:59:07 - pacemakerd[1324923]: crit: Could not connect to Corosync
CFG: CS_ERR_LIBRARY
Apr 18 14:59:08 - pacemakerd[1324923]: crit: Could not connect to Corosync
CFG: CS_ERR_LIBRARY
Apr 18 14:59:09 - pacemakerd[1324923]: crit: Could not connect to Corosync
CFG: CS_ERR_LIBRARY
Apr 18 14:59:10 - pacemakerd[1324923]: crit: Could not connect to Corosync
CFG: CS_ERR_LIBRARY
Apr 18 14:59:11 - pacemakerd[1324923]: crit: Could not connect to Corosync
CFG: CS_ERR_LIBRARY
[~]$
Well if corosync isn't there that this is to be expected and pacemaker won't
recover corosync.
Can you check what systemd thinks about corosync (status/journal).
Klaus
{OPEN}
-----Message d'origine-----
De : Ken Gaillot <[email protected] <mailto:[email protected]> >
Envoyé : jeudi 18 avril 2024 16:40
À : Cluster Labs - All topics related to open-source clustering welcomed
<[email protected] <mailto:[email protected]> >
Cc : NOLIBOS Christophe <[email protected]>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix
What OS are you using? Does it use systemd?
What does happen when you kill Corosync?
On Thu, 2024-04-18 at 13:13 +0000, NOLIBOS Christophe via Users wrote:
> Classified as: {OPEN}
>
> Dear All,
>
> I have a question about the "pacemakerd: recover properly from
> Corosync crash" fix implemented in version 2.1.2.
> I have observed the issue when testing pacemaker version 2.0.5, just
> by killing the ‘corosync’ process: Corosync was not recovered.
>
> I am using now pacemaker version 2.1.5-8.
> Doing the same test, I have the same result: Corosync is still not
> recovered.
>
> Please confirm the "pacemakerd: recover properly from Corosync crash"
> fix implemented in version 2.1.2 covers this scenario.
> If it is, did I miss something in the configuration of my cluster?
>
> Best Regard.
>
> Christophe.
>
>
>
> {OPEN}
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
--
Ken Gaillot <[email protected] <mailto:[email protected]> >
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
{OPEN}
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
