Hi, first: is there a tutorial or s.th. else which helps in understanding what pacemaker logs in syslog and /var/log/cluster/corosync.log ? I try hard to find out what's going wrong, but they are difficult to understand, also because of the amount of information. Or should i deal more with "crm histroy" or hb_report ?
What happened: I tried to configure a simple drbd resource following http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html#idm140457860751296 I used this simple snip from the doc: configure primitive WebData ocf:linbit:drbd params drbd_resource=wwwdata \ op monitor interval=60s I did it on live cluster, which is in testing currently. I will never do this again. Shadow will be my friend. The cluster reacted promptly: crm(live)# configure primitive prim_drbd_idcc_devel ocf:linbit:drbd params drbd_resource=idcc-devel \ > op monitor interval=60 WARNING: prim_drbd_idcc_devel: default timeout 20s for start is smaller than the advised 240 WARNING: prim_drbd_idcc_devel: default timeout 20s for stop is smaller than the advised 100 WARNING: prim_drbd_idcc_devel: action monitor not advertised in meta-data, it may not be supported by the RA From what i understand until now is that i didn't configure start/stop operations, so the cluster chooses the default from default-action-timeout. It didn't configure the monitor operation, because this is not in the meta-data. I checked it: crm(live)# ra info ocf:linbit:drbd Manages a DRBD device as a Master/Slave resource (ocf:linbit:drbd) Operations' defaults (advisory minimum): start timeout=240 promote timeout=90 demote timeout=90 notify timeout=90 stop timeout=100 monitor_Slave timeout=20 interval=20 monitor_Master timeout=20 interval=10 OK. I have to configure monitor_Slave and monitor_Master. The log says: Aug 1 14:19:33 ha-idg-1 drbd(prim_drbd_idcc_devel)[11325]: ERROR: meta parameter misconfigured, expected clone-max -le 2, but found unset. ^^^^^^^^^ Aug 1 14:19:33 ha-idg-1 crmd[4692]: notice: process_lrm_event: Operation prim_drbd_idcc_devel_monitor_0: not configured (node=ha-idg-1, call=73, rc=6, cib-update=37, confirmed=true) Aug 1 14:19:33 ha-idg-1 crmd[4692]: notice: process_lrm_event: Operation prim_drbd_idcc_devel_stop_0: not configured (node=ha-idg-1, call=74, rc=6, cib-update=38, confirmed=true) Why is it complaining about missing clone-max ? This is a meta attribute for a clone, but not for a simple resource !?! This message is constantly repeated, it still appears although cluster is in standby since three days. And why does it complain that stop is not configured ? Isn't that configured with the default of 20 sec. ? That's what crm said. See above. This message is also repeated nearly 7000 times in 9 minutes. If the stop op is not configured and the cluster complains about it, why does it not complain about a unconfigured start op ? That the missing monitor is complained is clear. The DC says: Aug 1 14:19:33 ha-idg-2 pengine[27043]: warning: unpack_rsc_op_failure: Processing failed op stop for prim_drbd_idcc_devel on ha-idg-1: not configured (6) Aug 1 14:19:33 ha-idg-2 pengine[27043]: error: unpack_rsc_op: Preventing prim_drbd_idcc_devel from re-starting anywhere: operation stop failed 'not configured' (6) Again complaining about a failed stop, saying it's not configured. Or does it complain that the fail of a stop op is not configured ? The doc says: "Some operations are generated by the cluster itself, for example, stopping and starting resources as needed." http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html . Is the doc wrong ? What happens when i DON'T configure start/stop operations ? Are they configured automatically ? I have several primitives without a configured start/stop operation, but never had any problems with them. failcount is direct INFINITY: Aug 1 14:19:33 ha-idg-1 attrd[4690]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prim_drbd_idcc_devel (INFINITY) Aug 1 14:19:33 ha-idg-1 attrd[4690]: notice: attrd_perform_update: Sent update 8: fail-count-prim_drbd_idcc_devel=INFINITY After exact 9 minutes the complaints about the not configured stop operation stopped, the complaints about missing clone-max still appears, although both nodes are in standby now fail-count is 1 million: Aug 1 14:28:33 ha-idg-1 attrd[4690]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prim_drbd_idcc_devel (1000000) Aug 1 14:28:33 ha-idg-1 attrd[4690]: notice: attrd_perform_update: Sent update 7076: fail-count-prim_drbd_idcc_devel=1000000 and a complain about monitor operation appeared again: Aug 1 14:28:33 ha-idg-1 crmd[4692]: notice: process_lrm_event: Operation prim_drbd_idcc_devel_monitor_60000: not configured (node=ha-idg-1, call=6968, rc=6, cib-update=6932, confirmed=false) Aug 1 14:28:33 ha-idg-1 attrd[4690]: notice: attrd_cs_dispatch: Update relayed from ha-idg-2 crm_mon said: Failed actions: prim_drbd_idcc_devel_stop_0 on ha-idg-1 'not configured' (6): call=6967, status=complete, exit-reason='none', last-rc-change='Tue Aug 1 14:28:33 2017', queued=0ms, exec=41ms prim_drbd_idcc_devel_monitor_60000 on ha-idg-1 'not configured' (6): call=6968, status=complete, exit-reason='none', last-rc-change='Tue Aug 1 14:28:33 2017', queued=0ms, exec=41ms prim_drbd_idcc_devel_stop_0 on ha-idg-2 'not configured' (6): call=6963, status=complete, exit-reason='none', last-rc-change='Tue Aug 1 14:28:33 2017', queued=0ms, exec=40ms A big problem was that i have a ClusterMon resource running on each node. It triggered about 20000 snmp traps in 193 seconds to my management station, which triggered 20000 e-Mails ... From where comes this incredible amount of traps ? Nearly all traps said that stop is not configured for the drdb resource. Why complaining so often ? And why stopping after ~20.000 traps ? And complaining about not configured monitor operation just 8 times. Btw: is there a history like in the bash where i see which crm command i entered at which time ? I know that crm history is mighty, but didn't find that. Bernd -- Bernd Lentes Systemadministration institute of developmental genetics Gebäude 35.34 - Raum 208 HelmholtzZentrum München [email protected] phone: +49 (0)89 3187 1241 fax: +49 (0)89 3187 2294 no backup - no mercy Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 _______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
