Hi! That's odd indeed. We have a server that indicates network traffic via LED on the front, and using a dedicated cluster NIC, we see a constant flickering on the LED. Also the nodes detect node failures via corosync almost immediately. In your case it seems a cluster recheck is needed to detect a node failure. I don't know why. Sorry! Do you have any special configuration parameter in corosync? Maybe show the config.
Regards, Ulrich >>> ashutosh tiwari <[email protected]> schrieb am 10.01.2018 um 11:47 in Nachricht <ca+vegjhq6yerz1fo4sn312e6hxc-agmeabricuewtbwvqvm...@mail.gmail.com>: > Hi Ulrich, > > Thanks for the response. > > 30 sec is the time for detection only as confirmed by logs. > > ++++++++++++++++++++++++++++++++++++ > Jan 10 11:06:18 [19261] orana crmd: info: crm_timer_popped: > PEngine Recheck Timer (I_PE_CALC) just popped (30000ms) > Jan 10 11:06:18 [19261] orana crmd: notice: do_state_transition: > State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_TIMER_POPPED origin=crm_timer_popped ] > Jan 10 11:06:18 [19261] orana crmd: info: do_state_transition: > Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED > Jan 10 11:06:18 [19260] orana pengine: info: process_pe_message: > Input has not changed since last time, not saving to disk > Jan 10 11:06:18 [19260] orana pengine: notice: unpack_config: On > loss of CCM Quorum: Ignore > Jan 10 11:06:18 [19260] orana pengine: info: > determine_online_status_fencing: Node tigana is active > Jan 10 11:06:18 [19260] orana pengine: info: > determine_online_status: Node tigana is online > Jan 10 11:06:18 [19260] orana pengine: info: > determine_online_status_fencing: Node orana is active > Jan 10 11:06:18 [19260] orana pengine: info: > determine_online_status: Node orana is online > Jan 10 11:06:18 [19260] orana pengine: info: clone_print: > Master/Slave Set: unicloud-master [unicloud] > Jan 10 11:06:18 [19260] orana pengine: info: short_print: > Masters: [ tigana ] > Jan 10 11:06:18 [19260] orana pengine: info: short_print: > Slaves: [ orana ] > Jan 10 11:06:18 [19260] orana pengine: info: native_print: > fence-uc-orana (stonith:fence_ilo4): Started tigana > Jan 10 11:06:18 [19260] orana pengine: info: native_print: > fence-uc-tigana (stonith:fence_ilo4): Started tigana > Jan 10 11:06:18 [19260] orana pengine: info: master_color: > Promoting unicloud:0 (Master tigana) > Jan 10 11:06:18 [19260] orana pengine: info: master_color: > unicloud-master: Promoted 1 instances of a possible 1 to master > Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave > unicloud:0 (Master tigana) > Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave > unicloud:1 (Slave orana) > Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave > fence-uc-orana (Started tigana) > Jan 10 11:06:18 [19260] orana pengine: info: LogActions: Leave > fence-uc-tigana (Started tigana) > Jan 10 11:06:18 [19260] orana pengine: notice: process_pe_message: > Calculated Transition 2390: /var/lib/pacemaker/pengine/pe-input-1655.bz2 > Jan 10 11:06:18 [19261] orana crmd: info: do_state_transition: > State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ > input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] > Jan 10 11:06:18 [19261] orana crmd: info: do_te_invoke: > Processing graph 2390 (ref=pe_calc-dc-1515562578-2650) derived from > /var/lib/pacemaker/pengine/pe-input-1655.bz2 > Jan 10 11:06:18 [19261] orana crmd: notice: run_graph: Transition > 2390 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-1655.bz2): Complete > Jan 10 11:06:18 [19261] orana crmd: info: do_log: FSA: Input > I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE > Jan 10 11:06:18 [19261] orana crmd: notice: do_state_transition: > State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Jan 10 11:06:31 corosync [TOTEM ] A processor failed, forming new > configuration. > Jan 10 11:06:33 corosync [QUORUM] Members[1]: 1 > Jan 10 11:06:33 corosync [TOTEM ] A processor joined or left the membership > and a new membership was formed. > Jan 10 11:06:33 [19250] orana pacemakerd: info: cman_event_callback: > Membership 2064: quorum retained > Jan 10 11:06:33 [19261] orana crmd: info: cman_event_callback: > Membership 2064: quorum retained > Jan 10 11:06:33 [19250] orana pacemakerd: notice: > crm_update_peer_state_iter: cman_event_callback: Node tigana[2] - state is > now lost (was member) > Jan 10 11:06:33 [19261] orana crmd: notice: > crm_update_peer_state_iter: cman_event_callback: Node tigana[2] - state is > now lost (was member) > Jan 10 11:06:33 [19261] orana crmd: info: peer_update_callback: > tigana is now lost (was member) > Jan 10 11:06:33 [19261] orana crmd: warning: match_down_event: No > match for shutdown action on tigana > Jan 10 11:06:33 [19261] orana crmd: notice: peer_update_callback: > Stonith/shutdown of tigana not matched > Jan 10 11:06:33 [19261] orana crmd: info: crm_update_peer_join: > peer_update_callback: Node tigana[2] - join-2 phase 4 -> 0 > Jan 10 11:06:33 [19261] orana crmd: info: > abort_transition_graph: Transition aborted: Node failure > (source=peer_update_callback:240, 1) > Jan 10 11:06:33 corosync [CPG ] chosen downlist: sender r(0) ip(7.7.7.1) > ; members(old:2 left:1) > ++++++++++++++++++++++++++ > > this is the logs from standby node(new active). > kernel panic was triggered at 11:06:00 at the other node and here totem > change is reported at 11:06:31. > > 30 secs is the cluster recheck timer. > > Regards, > Ashutosh > > > On Wed, Jan 10, 2018 at 3:12 PM, <[email protected]> wrote: > >> Send Users mailing list submissions to >> [email protected] >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://lists.clusterlabs.org/mailman/listinfo/users >> or, via email, send a message with subject or body 'help' to >> [email protected] >> >> You can reach the person managing the list at >> [email protected] >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of Users digest..." >> >> >> Today's Topics: >> >> 1. corosync taking almost 30 secs to detect node failure in case >> of kernel panic (ashutosh tiwari) >> 2. Antw: corosync taking almost 30 secs to detect node failure >> in case of kernel panic (Ulrich Windl) >> 3. pacemaker reports monitor timeout while CPU is high (???) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Wed, 10 Jan 2018 12:43:46 +0530 >> From: ashutosh tiwari <[email protected]> >> To: [email protected] >> Subject: [ClusterLabs] corosync taking almost 30 secs to detect node >> failure in case of kernel panic >> Message-ID: >> <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muVb >> [email protected]> >> Content-Type: text/plain; charset="utf-8" >> >> Hi, >> >> We have two node cluster running in active/standby mode and having IPMI >> fencing configured. >> >> In case of kernel panic at Active node, standby node is detecting node >> failure in around 30 secs which leads to delay in standby node taking the >> active role. >> >> we have totem token timeout as 10000 msecs. >> Please let us know in case there is any more configuration controlling >> membership detection. >> >> s/w versions. >> >> centos 6.7 >> corosync-1.4.7-5.el6.x86_64 >> pacemaker-1.1.14-8.el6.x86_64 >> >> Thanks and Regards, >> Ashutosh Tiwari >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: <http://lists.clusterlabs.org/pipermail/users/attachments/ >> 20180110/235f148d/attachment-0001.html> >> >> ------------------------------ >> >> Message: 2 >> Date: Wed, 10 Jan 2018 08:32:16 +0100 >> From: "Ulrich Windl" <[email protected]> >> To: <[email protected]> >> Subject: [ClusterLabs] Antw: corosync taking almost 30 secs to detect >> node failure in case of kernel panic >> Message-ID: <[email protected]> >> Content-Type: text/plain; charset=US-ASCII >> >> Hi! >> >> Maybe define "detecting node failure". Culkd it be your 30 seconds are >> between detection and reaction? Logs would help here, too. >> >> Regards, >> Ulrich >> >> >> >>> ashutosh tiwari <[email protected]> schrieb am 10.01.2018 um >> 08:13 in >> Nachricht >> <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muv...@mail.gmail.com>: >> > Hi, >> > >> > We have two node cluster running in active/standby mode and having IPMI >> > fencing configured. >> > >> > In case of kernel panic at Active node, standby node is detecting node >> > failure in around 30 secs which leads to delay in standby node taking the >> > active role. >> > >> > we have totem token timeout as 10000 msecs. >> > Please let us know in case there is any more configuration controlling >> > membership detection. >> > >> > s/w versions. >> > >> > centos 6.7 >> > corosync-1.4.7-5.el6.x86_64 >> > pacemaker-1.1.14-8.el6.x86_64 >> > >> > Thanks and Regards, >> > Ashutosh Tiwari >> >> >> >> >> ------------------------------ >> >> Message: 3 >> Date: Wed, 10 Jan 2018 09:40:51 +0000 >> From: ??? <[email protected]> >> To: Cluster Labs - All topics related to open-source clustering >> welcomed <[email protected]> >> Subject: [ClusterLabs] pacemaker reports monitor timeout while CPU is >> high >> Message-ID: <[email protected]> >> Content-Type: text/plain; charset="utf-8" >> >> Hello, >> >> This issue only appears when we run performance test and the CPU is high. >> The cluster and log is as below. The Pacemaker will restart the Slave Side >> pgsql-ha resource about every two minutes. >> >> Take the following scenario for example:?when the pgsqlms RA is called, we >> print the log ?execute the command start (command)?. When the command is >> returned, we print the log ?execute the command stop (Command) (result)?? >> >> 1. We could see that pacemaker call ?pgsqlms monitor? about every 15 >> seconds. And it return $OCF_SUCCESS >> >> 2. In calls monitor command again at 13:56:16, and then it reports >> timeout error error 13:56:18. It is only 2 seconds but it reports >> ?timeout=10000ms? >> >> 3. In other logs, sometimes after 15 minutes, there is no ?execute the >> command start monitor? printed and it reports timeout error directly. >> >> Could you please tell how to debug or resolve such issue? >> >> The log: >> >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command >> start monitor >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop >> monitor 0 >> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command >> start monitor >> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start >> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 >> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop >> monitor 0 >> Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU load detected: >> 426.779999 >> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command >> start monitor >> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process >> (PID 5606) timed out >> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - >> timed out after 10000ms >> Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for >> pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=10000ms >> Jan 10 13:56:18 sds2 crmd[26096]: notice: db2-pgsqld_monitor_16000:102 [ >> /tmp:5432 - accepting connections\n ] >> Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> >> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL >> origin=abort_transition_graph >> Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor >> for pgsqld:0 on db2: unknown error (1) >> Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start >> for pgsqld:1 on db1: unknown error (1) >> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from >> db1 after 1000000 failures (max=1000000) >> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from >> db1 after 1000000 failures (max=1000000) >> Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover pgsqld:0#011(Slave >> db2) >> Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition 37, >> saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2 >> >> >> The Cluster Configuration: >> 2 nodes and 13 resources configured >> >> Online: [ db1 db2 ] >> >> Full list of resources: >> >> Clone Set: dlm-clone [dlm] >> Started: [ db1 db2 ] >> Clone Set: clvmd-clone [clvmd] >> Started: [ db1 db2 ] >> ipmi_node1 (stonith:fence_ipmilan): Started db2 >> ipmi_node2 (stonith:fence_ipmilan): Started db1 >> Clone Set: clusterfs-clone [clusterfs] >> Started: [ db1 db2 ] >> Master/Slave Set: pgsql-ha [pgsqld]> >> >> Masters: [ db1 ] >> >> Slaves: [ db2 ] >> Resource Group: mastergroup >> db1-vip (ocf::heartbeat:IPaddr2): Started >> rep-vip (ocf::heartbeat:IPaddr2): Started >> Resource Group: slavegroup >> db2-vip (ocf::heartbeat:IPaddr2): Started >> >> >> pcs resource show pgsql-ha >> Master: pgsql-ha >> Meta Attrs: interleave=true notify=true >> Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms) >> Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data >> Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s) >> stop interval=0s timeout=60s (pgsqld-stop-interval-0s) >> promote interval=0s timeout=130s >> (pgsqld-promote-interval-0s) >> demote interval=0s timeout=120s (pgsqld-demote-interval-0s) >> monitor interval=15s role=Master timeout=10s >> (pgsqld-monitor-interval-15s) >> monitor interval=16s role=Slave timeout=10s >> (pgsqld-monitor-interval-16s) >> notify interval=0s timeout=60s (pgsqld-notify-interval-0s) >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: <http://lists.clusterlabs.org/pipermail/users/attachments/ >> 20180110/88e7c872/attachment.html> >> >> ------------------------------ >> >> _______________________________________________ >> Users mailing list >> [email protected] >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> >> End of Users Digest, Vol 36, Issue 8 >> ************************************ >> _______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
