On 01/04/2017 02:06 PM, Alfonso Ali wrote: > Hi Ulrich, > > You're right, it is as if stonithd selected the incorrect device to > reboot the node. I'm using fence_ilo as the stonith agent, and > reviewing the params it take is not clear which one (besides the name > which is irrelevant for stonithd) should be used to fence each node. > > In cman+rgmanager you can associate a fence device params to each > node, for example: > > <clusternode name="e1b07" nodeid="2"> > <fence> > <method name="single"> > <device name="fence_ilo" ipaddr="e1b07-ilo"/> > </method> > </fence> > </clusternode> > > what's the equivalent of that in corosync+pacemaker (using crm)?? > > In general, in a cluster of more than 2 nodes and more than 2 stonith > devices, how stonithd find which stonith device should be used to > fence a specific node??
You have the attributes pcmk_host_list & pcmk_host_map to control that. > > Regards, > Ali > > On Wed, Jan 4, 2017 at 2:27 AM, Ulrich Windl > <[email protected] > <mailto:[email protected]>> wrote: > > Hi! > > A few messages that look uncommon to me are: > > crm_reap_dead_member: Removing node with name unknown and id > 1239211543 from membership cache > > A bit later the node name is known: > info: crm_update_peer_proc: pcmk_cpg_membership: Node > e1b13[1239211543] - corosync-cpg is now offline > > Another node seems to go offline also: > crmd: info: peer_update_callback: Client e1b13/peer now has > status [offline] (DC=e1b07, changed=4000000) > > This looks OK to me: > stonith-ng: debug: get_capable_devices: Searching through > 3 devices to see what is capable of action (reboot) for target e1b13 > stonith-ng: debug: stonith_action_create: Initiating action > status for agent fence_ilo (target=e1b13) > > This looks odd to me: > stonith-ng: debug: stonith_device_execute: Operation status > for node e1b13 on fence-e1b03 now running with pid=25689, timeout=20s > stonith-ng: debug: stonith_device_execute: Operation status > for node e1b13 on fence-e1b07 now running with pid=25690, timeout=20s > stonith-ng: debug: stonith_device_execute: Operation status > for node e1b13 on fence-e1b13 now running with pid=25691, timeout=20s > > Maybe not, because you can fence the node oin three different ways > it seems: > stonith-ng: debug: stonith_query_capable_device_cb: Found 3 > matching devices for 'e1b13' > > Now it's geting odd: > stonith-ng: debug: schedule_stonith_command: Scheduling reboot > on fence-e1b07 for remote peer e1b07 with op id > (ae1956b5-ffe1-4d6a-b5a2-c7bba2c6d7fd) (timeout=60s) > stonith-ng: debug: stonith_action_create: Initiating action > reboot for agent fence_ilo (target=e1b13) > stonith-ng: debug: stonith_device_execute: Operation reboot > for node e1b13 on fence-e1b07 now running with pid=25784, timeout=60s > crmd: info: crm_update_peer_expected: handle_request: > Node e1b07[1239211582] - expected state is now down (was member) > stonith-ng: debug: st_child_done: Operation 'reboot' on > 'fence-e1b07' completed with rc=0 (2 remaining) > stonith-ng: notice: log_operation: Operation 'reboot' [25784] > (call 6 from crmd.1201) for host 'e1b13' with device 'fence-e1b07' > returned: 0 (OK) > attrd: info: crm_update_peer_proc: pcmk_cpg_membership: Node > e1b07[1239211582] - corosync-cpg is now offline > > To me it looks as if your STONITH agents kill the wrong node for > reasons unknown to me. > > (didn't inspect the whole logs) > > Regards, > Ulrich > > > >>> Alfonso Ali <[email protected] > <mailto:[email protected]>> schrieb am 03.01.2017 um 18:54 in > Nachricht > <CANeoTMee-=_-Gtf_vxigKsrXNQ0pWEUAg=7yjhrhvrwdnth...@mail.gmail.com > <mailto:[email protected]>>: > > Hi Ulrich, > > > > I'm using udpu and static node list. This is my corosync conf: > > > > --------------------Corosync > > configuration------------------------------------------------- > > totem { > > version: 2 > > cluster_name: test-cluster > > token: 3000 > > token_retransmits_before_loss_const: 10 > > clear_node_high_bit: yes > > crypto_cipher: aes256 > > crypto_hash: sha1 > > transport: udpu > > > > interface { > > ringnumber: 0 > > bindnetaddr: 201.220.222.0 > > mcastport: 5405 > > ttl: 1 > > } > > } > > > > logging { > > fileline: off > > to_stderr: no > > to_logfile: no > > to_syslog: yes > > syslog_facility: daemon > > debug: on > > timestamp: on > > logger_subsys { > > subsys: QUORUM > > debug: on > > } > > } > > > > quorum { > > provider: corosync_votequorum > > expected_votes: 3 > > } > > > > nodelist { > > node: { > > ring0_addr: 201.220.222.62 > > } > > node: { > > ring0_addr: 201.220.222.23 > > } > > node: { > > ring0_addr: 201.220.222.61 > > } > > node: { > > ring0_addr: 201.220.222.22 > > } > > } > > --------------------------------/Corosync > > conf------------------------------------------------------- > > > > The pacemaker log is very long, i'm sending it attached as a zip > file, > > don't know if the list will allow it, if not please tell me > which sections > > (stonith, crmd, lrmd, attrd, cib) should i post. > > > > For a better understanding, the cluster have 4 nodes, e1b03, > e1b07, e1b12 > > and e1b13. I simulated a crash on e1b13 with: > > > > echo c > /proc/sysrq-trigger > > > > the cluster detected e1b13 as crashed and reboot it, but after > that e1b07 > > was restarted too, and later e1b03 did the same, the only node > that remined > > alive was e1b12. The attached log was taken from that node. > > > > Let me know if any other info is needed to debug the problem. > > > > Regards, > > Ali > > > > > > > > On Mon, Jan 2, 2017 at 3:30 AM, Ulrich Windl < > > [email protected] > <mailto:[email protected]>> wrote: > > > >> Hi! > >> > >> Seeing the detailed log of events would be helpful. Despite of > that we had > >> a similar issue with using multicast (and after adding a new > node to an > >> existing cluster). Switching to UDPU helped in our case, but > unless we see > >> the details, it's all just guessing... > >> > >> Ulrich > >> P.S. A good new year to everyone! > >> > >> >>> Alfonso Ali <[email protected] > <mailto:[email protected]>> schrieb am 30.12.2016 um 21:40 in > >> Nachricht > >> > <CANeoTMcuNGw_T9e4WNEEK-nmHnV-NwiX2Ck0UBDnVeuoiC=r...@mail.gmail.com > <mailto:[email protected]>>: > >> > Hi, > >> > > >> > I have a four node cluster that uses iLo as fencing agent. When i > >> simulate > >> > a node crash (either killing corosync or echo c > > /proc/sysrq-trigger) > >> the > >> > node is marked as UNCLEAN and requested to be restarted by > the stonith > >> > agent, but everytime that happens another node in the cluster > is also > >> > marked as UNCLEAN and rebooted as well. After the nodes are > rebooted they > >> > are marked as online again and cluster resume operation > without problem. > >> > > >> > I have reviewed corosync and pacemaker logs but found nothing > that > >> explain > >> > why the other node is also rebooted. > >> > > >> > Any hint of what to check or what to look for would be > appreciated. > >> > > >> > -----------------Cluster conf---------------------------------- > >> > node 1239211542: e1b12 \ > >> > attributes standby=off > >> > node 1239211543: e1b13 > >> > node 1239211581: e1b03 \ > >> > attributes standby=off > >> > node 1239211582: e1b07 \ > >> > attributes standby=off > >> > primitive fence-e1b03 stonith:fence_ilo \ > >> > params ipaddr=e1b03-ilo login=fence_agent passwd=XXX > ssl_insecure=1 \ > >> > op monitor interval=300 timeout=120 \ > >> > meta migration-threshold=2 target-role=Started > >> > primitive fence-e1b07 stonith:fence_ilo \ > >> > params ipaddr=e1b07-ilo login=fence_agent passwd=XXX > ssl_insecure=1 \ > >> > op monitor interval=300 timeout=120 \ > >> > meta migration-threshold=2 target-role=Started > >> > primitive fence-e1b12 stonith:fence_ilo \ > >> > params ipaddr=e1b12-ilo login=fence_agent passwd=XXX > ssl_insecure=1 \ > >> > op monitor interval=300 timeout=120 \ > >> > meta migration-threshold=2 target-role=Started > >> > primitive fence-e1b13 stonith:fence_ilo \ > >> > params ipaddr=e1b13-ilo login=fence_agent passwd=XXX > ssl_insecure=1 \ > >> > op monitor interval=300 timeout=120 \ > >> > meta migration-threshold=2 target-role=Started > >> > ..... extra resources ...... > >> > location l-f-e1b03 fence-e1b03 \ > >> > rule -inf: #uname eq e1b03 \ > >> > rule 10000: #uname eq e1b07 > >> > location l-f-e1b07 fence-e1b07 \ > >> > rule -inf: #uname eq e1b07 \ > >> > rule 10000: #uname eq e1b03 > >> > location l-f-e1b12 fence-e1b12 \ > >> > rule -inf: #uname eq e1b12 \ > >> > rule 10000: #uname eq e1b13 > >> > location l-f-e1b13 fence-e1b13 \ > >> > rule -inf: #uname eq e1b13 \ > >> > rule 10000: #uname eq e1b12 > >> > property cib-bootstrap-options: \ > >> > have-watchdog=false \ > >> > dc-version=1.1.15-e174ec8 \ > >> > cluster-infrastructure=corosync \ > >> > stonith-enabled=true \ > >> > cluster-name=test-cluster \ > >> > no-quorum-policy=freeze \ > >> > last-lrm-refresh=1483125286 > >> > ------------------------------------------------------------ > >> ---------------- > >> > ------------ > >> > > >> > Regards, > >> > Ali > >> > > > > > > > _______________________________________________ > Users mailing list: [email protected] > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
