Re: [ClusterLabs] Cluster Stopped, No Messages?

Eric Robinson Tue, 01 Jun 2021 09:21:15 -0700

> -----Original Message-----
> From: Users <[email protected]> On Behalf Of Klaus
> Wenninger
> Sent: Monday, May 31, 2021 12:54 AM
> To: [email protected]
> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
>
> On 5/29/21 12:21 AM, Strahil Nikolov wrote:
> > I agree -> fencing is mandatory.
> Agreed that with proper fencing setup the cluster wouldn'thave run into that
> state.
> But still it might be interesting to find out what has happened.

Thank you for looking past the fencing issue to the real question. Regardless 
of whether or not fencing was enabled, there should still be some indication of 
what actions the cluster took and why, but it appears that cluster services 
just terminated silently.

>Not seeing anything in the log snippet either.

Me neither.

> Assuming you are running something systemd-based. Centos 7.

Yes. CentOS Linux release 7.5.1804.

> Did you check the journal for pacemaker to see what systemd is thinking?
> With the standard unit-file systemd should observe pacemakerd and restart
> it if it goes away ungracefully.

The only log entry showing Pacemaker startup that I found in any of the 
messages files (current and several days of history) was the one when I started 
the cluster manually (see below).

> You should be able to test this behavior sending a SIGKILL to pacemakerd.
> pacemakerd in turn watches out for signals from the sub-daemons it has
> spawned (I'm currently working on more in-depth observation here.).
> So just disappearing shouldn't happen that easily.

Agreed. From looking at the cluster log, it appears that it just stopped making 
log entries at 9:40. Then at 11:49, I found the cluster service stopped and 
started them...

May 27 09:25:31 [92171] 001store01a       crmd:     info: do_state_transition:  
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response
May 27 09:25:31 [92171] 001store01a       crmd:     info: do_te_invoke: 
Processing graph 91482 (ref=pe_calc-dc-1622121931-124396) derived from 
/var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 09:25:31 [92171] 001store01a       crmd:   notice: run_graph:    
Transition 91482 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 09:25:31 [92171] 001store01a       crmd:     info: do_log:       Input 
I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd
May 27 09:25:31 [92171] 001store01a       crmd:   notice: do_state_transition:  
State transition S_TRANSITION_ENGINE -> S_IDLE | input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd
May 27 09:40:31 [92171] 001store01a       crmd:     info: crm_timer_popped:     
PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
May 27 09:40:31 [92171] 001store01a       crmd:   notice: do_state_transition:  
State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC 
cause=C_TIMER_POPPED origin=crm_timer_popped
May 27 09:40:31 [92171] 001store01a       crmd:     info: do_state_transition:  
Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
May 27 09:40:31 [92170] 001store01a    pengine:     info: process_pe_message:   
Input has not changed since last time, not saving to disk
May 27 09:40:31 [92170] 001store01a    pengine:     info: 
determine_online_status:      Node 001store01a is online
May 27 09:40:31 [92170] 001store01a    pengine:     info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-itls active on 001store01a
May 27 09:40:31 [92170] 001store01a    pengine:  warning: 
unpack_rsc_op_failure:        Processing failed op monitor for p_vip_ftpclust01 
on 001store01a: unknown error (1)
May 27 09:40:31 [92170] 001store01a    pengine:     info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-etls active on 001store01a
May 27 09:40:31 [92170] 001store01a    pengine:     info: unpack_node_loop:     
Node 1 is already processed
May 27 09:40:31 [92170] 001store01a    pengine:     info: unpack_node_loop:     
Node 1 is already processed
May 27 09:40:31 [92170] 001store01a    pengine:     info: common_print: 
p_vip_ftpclust01        (ocf::heartbeat:IPaddr2):       Started 001store01a
May 27 09:40:31 [92170] 001store01a    pengine:     info: common_print: 
p_replicator    (systemd:pure-replicator):      Started 001store01a
May 27 09:40:31 [92170] 001store01a    pengine:     info: common_print: 
p_pure-ftpd-etls        (systemd:pure-ftpd-etls):       Started 001store01a
May 27 09:40:31 [92170] 001store01a    pengine:     info: common_print: 
p_pure-ftpd-itls        (systemd:pure-ftpd-itls):       Started 001store01a
May 27 09:40:31 [92170] 001store01a    pengine:     info: LogActions:   Leave   
p_vip_ftpclust01        (Started 001store01a)
May 27 09:40:31 [92170] 001store01a    pengine:     info: LogActions:   Leave   
p_replicator    (Started 001store01a)
May 27 09:40:31 [92170] 001store01a    pengine:     info: LogActions:   Leave   
p_pure-ftpd-etls        (Started 001store01a)
May 27 09:40:31 [92170] 001store01a    pengine:     info: LogActions:   Leave   
p_pure-ftpd-itls        (Started 001store01a)
May 27 09:40:31 [92170] 001store01a    pengine:   notice: process_pe_message:   
Calculated transition 91483, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 09:40:31 [92171] 001store01a       crmd:     info: do_state_transition:  
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response
May 27 09:40:31 [92171] 001store01a       crmd:     info: do_te_invoke: 
Processing graph 91483 (ref=pe_calc-dc-1622122831-124397) derived from 
/var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 09:40:31 [92171] 001store01a       crmd:   notice: run_graph:    
Transition 91483 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 09:40:31 [92171] 001store01a       crmd:     info: do_log:       Input 
I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd
May 27 09:40:31 [92171] 001store01a       crmd:   notice: do_state_transition:  
State transition S_TRANSITION_ENGINE -> S_IDLE | input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd
[10667] 001store01a.ccnva.local corosyncnotice  [MAIN  ] Corosync Cluster 
Engine ('2.4.3'): started and ready to provide service.
[10667] 001store01a.ccnva.local corosyncinfo    [MAIN  ] Corosync built-in 
features: dbus systemd xmlconf qdevices qnetd snmp libcgroup pie relro bindnow
[10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] Initializing transport 
(UDP/IP Unicast).
[10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] Initializing 
transmit/receive security (NSS) crypto: none hash: none
[10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] The network interface 
[10.51.14.40] is now up.
[10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service engine loaded: 
corosync configuration map access [0]
[10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server name: cmap
[10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service engine loaded: 
corosync configuration service [1]
[10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server name: cfg
[10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service engine loaded: 
corosync cluster closed process group service v1.01 [2]
[10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server name: cpg
[10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service engine loaded: 
corosync profile loading service [4]
[10667] 001store01a.ccnva.local corosyncnotice  [QUORUM] Using quorum provider 
corosync_votequorum
[10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for all 
cluster members. Current votes: 1 expected_votes: 2
[10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service engine loaded: 
corosync vote quorum service v1.0 [5]
[10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server name: votequorum
[10667] 001store01a.ccnva.local corosyncnotice  [SERV  ] Service engine loaded: 
corosync cluster quorum service v0.1 [3]
[10667] 001store01a.ccnva.local corosyncinfo    [QB    ] server name: quorum
[10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] adding new UDPU member 
{10.51.14.40}
[10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] adding new UDPU member 
{10.51.14.41}
[10667] 001store01a.ccnva.local corosyncnotice  [TOTEM ] A new membership 
(10.51.14.40:6412) was formed. Members joined: 1
[10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for all 
cluster members. Current votes: 1 expected_votes: 2
[10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for all 
cluster members. Current votes: 1 expected_votes: 2
[10667] 001store01a.ccnva.local corosyncnotice  [VOTEQ ] Waiting for all 
cluster members. Current votes: 1 expected_votes: 2
[10667] 001store01a.ccnva.local corosyncnotice  [QUORUM] Members[1]: 1
[10667] 001store01a.ccnva.local corosyncnotice  [MAIN  ] Completed service 
synchronization, ready to provide service.
May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:   notice: main:     
Starting Pacemaker 1.1.18-11.el7_5.3 | build=2b07d5c5a9 features: 
generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd 
nagios  corosync-native atomic-attrd acls
May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:     info: main:     
Maximum core file size is: 18446744073709551615
May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:     info: 
qb_ipcs_us_publish:       server name: pacemakerd
May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:     info: 
crm_get_peer:     Created entry 
05ad8b08-25a3-4a2d-84cb-1fc355fb697c/0x55d844a446b0 for node 001store01a/1 (1 
total)
May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:     info: 
crm_get_peer:     Node 1 is now known as 001store01a
May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:     info: 
crm_get_peer:     Node 1 has uuid 1
May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:     info: 
crm_update_peer_proc:     cluster_connect_cpg: Node 001store01a[1] - 
corosync-cpg is now online
May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:  warning: 
cluster_connect_quorum:   Quorum lost
May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:     info: 
crm_get_peer:     Created entry 
2f1f038e-9cc1-4a43-bab9-e7c91ca0bf3f/0x55d844a45ee0 for node 001store01b/2 (2 
total)
May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:     info: 
crm_get_peer:     Node 2 is now known as 001store01b
May 27 11:49:48 [10681] 001store01a.ccnva.local pacemakerd:     info: 
crm_get_peer:     Node 2 has uuid 2

And yet, the messages log seems to show cluster activity throughout the whole 
period...

May 27 09:40:31 001store01a crmd[92171]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
May 27 09:40:31 001store01a pengine[92170]: warning: Processing failed op 
monitor for p_vip_ftpclust01 on 001store01a: unknown error (1)
May 27 09:40:31 001store01a pengine[92170]:  notice: Calculated transition 
91467, saving inputs in /var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 09:40:31 001store01a crmd[92171]:  notice: Transition 91467 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 09:40:31 001store01a crmd[92171]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE
May 27 09:50:01 001store01a systemd: Started Session 202998 of user root.
May 27 09:50:01 001store01a systemd: Starting Session 202998 of user root.
May 27 09:51:08 001store01a python: 2021-05-27T09:51:08.080541Z INFO ExtHandler 
ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.2.54.2 is running as the goal state 
agent [DEBUG HeartbeatCounter: 2652;HeartbeatId: 
9010FCD6-7B1C-44D7-9DAD-DF324A613FD1;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpd
ate: 1]
May 27 09:55:31 001store01a crmd[92171]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
May 27 09:55:31 001store01a pengine[92170]: warning: Processing failed op 
monitor for p_vip_ftpclust01 on 001store01a: unknown error (1)
May 27 09:55:31 001store01a pengine[92170]:  notice: Calculated transition 
91468, saving inputs in /var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 09:55:31 001store01a crmd[92171]:  notice: Transition 91468 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 09:55:31 001store01a crmd[92171]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE
May 27 09:56:37 001store01a systemd-logind: Removed session 202944.
May 27 10:00:01 001store01a systemd: Started Session 202999 of user root.
May 27 10:00:01 001store01a systemd: Starting Session 202999 of user root.
May 27 10:00:01 001store01a systemd: Started Session 203000 of user root.
May 27 10:00:01 001store01a systemd: Starting Session 203000 of user root.
May 27 10:01:01 001store01a systemd: Started Session 203001 of user root.
May 27 10:01:01 001store01a systemd: Starting Session 203001 of user root.
May 27 10:10:01 001store01a systemd: Started Session 203002 of user root.
May 27 10:10:01 001store01a systemd: Starting Session 203002 of user root.
May 27 10:10:31 001store01a crmd[92171]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
May 27 10:10:31 001store01a pengine[92170]: warning: Processing failed op 
monitor for p_vip_ftpclust01 on 001store01a: unknown error (1)
May 27 10:10:31 001store01a pengine[92170]:  notice: Calculated transition 
91469, saving inputs in /var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 10:10:31 001store01a crmd[92171]:  notice: Transition 91469 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 10:10:31 001store01a crmd[92171]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE
May 27 10:10:48 001store01a python: 2021-05-27T10:10:48.294644Z INFO ExtHandler 
ExtHandler Checking for agent updates (family: Prod)
May 27 10:20:01 001store01a systemd: Started Session 203003 of user root.
May 27 10:20:01 001store01a systemd: Starting Session 203003 of user root.
May 27 10:21:12 001store01a python: 2021-05-27T10:21:12.024926Z INFO ExtHandler 
ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.2.54.2 is running as the goal state 
agent [DEBUG HeartbeatCounter: 2653;HeartbeatId: 
9010FCD6-7B1C-44D7-9DAD-DF324A613FD1;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpd
ate: 1]
May 27 10:25:31 001store01a crmd[92171]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
May 27 10:25:31 001store01a pengine[92170]: warning: Processing failed op 
monitor for p_vip_ftpclust01 on 001store01a: unknown error (1)
May 27 10:25:31 001store01a pengine[92170]:  notice: Calculated transition 
91470, saving inputs in /var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 10:25:31 001store01a crmd[92171]:  notice: Transition 91470 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 10:25:31 001store01a crmd[92171]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE
May 27 10:29:08 001store01a systemd: Starting "Sophos Linux Security update"...
May 27 10:29:24 001store01a savd: update.check: Successfully updated Sophos 
Anti-Virus from sdds:SOPHOS
May 27 10:29:24 001store01a systemd: Started "Sophos Linux Security update".
May 27 10:30:01 001store01a systemd: Started Session 203004 of user root.
May 27 10:30:01 001store01a systemd: Starting Session 203004 of user root.
May 27 10:40:01 001store01a systemd: Started Session 203005 of user root.
May 27 10:40:01 001store01a systemd: Starting Session 203005 of user root.
May 27 10:40:31 001store01a crmd[92171]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
May 27 10:40:31 001store01a pengine[92170]: warning: Processing failed op 
monitor for p_vip_ftpclust01 on 001store01a: unknown error (1)
May 27 10:40:31 001store01a pengine[92170]:  notice: Calculated transition 
91471, saving inputs in /var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 10:40:31 001store01a crmd[92171]:  notice: Transition 91471 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 10:40:31 001store01a crmd[92171]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE
May 27 10:49:11 001store01a rsyslogd: imjournal: journal reloaded... [v8.24.0 
try http://www.rsyslog.com/e/0 ]
May 27 10:50:01 001store01a systemd: Started Session 203006 of user root.
May 27 10:50:01 001store01a systemd: Starting Session 203006 of user root.
May 27 10:51:16 001store01a python: 2021-05-27T10:51:16.375599Z INFO ExtHandler 
ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.2.54.2 is running as the goal state 
agent [DEBUG HeartbeatCounter: 2654;HeartbeatId: 
9010FCD6-7B1C-44D7-9DAD-DF324A613FD1;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpd
ate: 1]
May 27 10:55:31 001store01a crmd[92171]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE
May 27 10:55:31 001store01a pengine[92170]: warning: Processing failed op 
monitor for p_vip_ftpclust01 on 001store01a: unknown error (1)
May 27 10:55:31 001store01a pengine[92170]:  notice: Calculated transition 
91472, saving inputs in /var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 10:55:31 001store01a crmd[92171]:  notice: Transition 91472 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 10:55:31 001store01a crmd[92171]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE

> Did you find any core-dumps?

No.

>
> Regards,
> Klaus
> >
> > You can enable the debug logs by editing corosync.conf or
> > /etc/sysconfig/pacemaker.
> >
> > In case simple reload doesn't work, you can set the cluster in global
> > maintenance, stop and then start the stack.
> >
> >
> > Best Regards,
> > Strahil Nikolov
> >
> >     On Fri, May 28, 2021 at 22:13, Digimer
> >     <[email protected]> wrote:
> >     On 2021-05-28 3:08 p.m., Eric Robinson wrote:
> >     >
> >     >> -----Original Message-----
> >     >> From: Digimer <[email protected] <mailto:[email protected]>>
> >     >> Sent: Friday, May 28, 2021 12:43 PM
> >     >> To: Cluster Labs - All topics related to open-source clustering
> >     welcomed
> >     >> <[email protected] <mailto:[email protected]>>; Eric
> >     Robinson <[email protected]
> >     <mailto:[email protected]>>; Strahil
> >     >> Nikolov <[email protected]
> <mailto:[email protected]>>
> >     >> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
> >     >>
> >     >> Shared storage is not what triggers the need for fencing.
> >     Coordinating actions
> >     >> is what triggers the need. Specifically; If you can run
> >     resource on both/all
> >     >> nodes at the same time, you don't need HA. If you can't, you
> >     need fencing.
> >     >>
> >     >> Digimer
> >     >
> >     > Thanks. That said, there is no fencing, so any thoughts on why
> >     the node behaved the way it did?
> >
> >     Without fencing, when a communication or membership issues arises,
> >     it's
> >     hard to predict what will happen.
> >
> >     I don't see anything in the short log snippet to indicate what
> >     happened.
> >     What's on the other node during the event? When did the node
> disappear
> >     and when was it rejoined, to help find relevant log entries?
> >
> >     Going forward, if you want predictable and reliable operation,
> >     implement
> >     fencing asap. Fencing is required.
> >
> >
> >     --
> >     Digimer
> >     Papers and Projects: https://alteeve.com/w/ <https://alteeve.com/w/>
> >     "I am, somehow, less interested in the weight and convolutions of
> >     Einstein’s brain than in the near certainty that people of equal
> >     talent
> >     have lived and died in cotton fields and sweatshops." - Stephen
> >     Jay Gould
> >
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Cluster Stopped, No Messages?

Reply via email to