[ClusterLabs] big trouble with a DRBD resource

Lentes, Bernd Fri, 04 Aug 2017 09:23:37 -0700

Hi,

first: is there a tutorial or s.th. else which helps in understanding what 
pacemaker logs in syslog and /var/log/cluster/corosync.log ?
I try hard to find out what's going wrong, but they are difficult to 
understand, also because of the amount of information.
Or should i deal more with "crm histroy" or hb_report ?


What happened:
I tried to configure a simple drbd resource following 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html#idm140457860751296
I used this simple snip from the doc:
configure primitive WebData ocf:linbit:drbd params drbd_resource=wwwdata \
    op monitor interval=60s

I did it on live cluster, which is in testing currently. I will never do this 
again. Shadow will be my friend.

The cluster reacted promptly:
crm(live)# configure primitive prim_drbd_idcc_devel ocf:linbit:drbd params 
drbd_resource=idcc-devel \
   > op monitor interval=60
WARNING: prim_drbd_idcc_devel: default timeout 20s for start is smaller than 
the advised 240
WARNING: prim_drbd_idcc_devel: default timeout 20s for stop is smaller than the 
advised 100
WARNING: prim_drbd_idcc_devel: action monitor not advertised in meta-data, it 
may not be supported by the RA

From what i understand until now is that i didn't configure start/stop 
operations, so the cluster chooses the default from default-action-timeout.
It didn't configure the monitor operation, because this is not in the meta-data.

I checked it:
crm(live)# ra info ocf:linbit:drbd
Manages a DRBD device as a Master/Slave resource (ocf:linbit:drbd)

Operations' defaults (advisory minimum):

    start         timeout=240
    promote       timeout=90
    demote        timeout=90
    notify        timeout=90
    stop          timeout=100
    monitor_Slave timeout=20 interval=20
    monitor_Master timeout=20 interval=10

OK. I have to configure monitor_Slave and monitor_Master.

The log says:
Aug  1 14:19:33 ha-idg-1 drbd(prim_drbd_idcc_devel)[11325]: ERROR: meta 
parameter misconfigured, expected clone-max -le 2, but found unset.
                                                                                
                          ^^^^^^^^^
Aug  1 14:19:33 ha-idg-1 crmd[4692]:   notice: process_lrm_event: Operation 
prim_drbd_idcc_devel_monitor_0: not configured (node=ha-idg-1, call=73, rc=6, 
cib-update=37, confirmed=true)
Aug  1 14:19:33 ha-idg-1 crmd[4692]:   notice: process_lrm_event: Operation 
prim_drbd_idcc_devel_stop_0: not configured (node=ha-idg-1, call=74, rc=6, 
cib-update=38, confirmed=true)

Why is it complaining about missing clone-max ? This is a meta attribute for a 
clone, but not for a simple resource !?! This message is constantly repeated, 
it still appears although cluster is in standby since three days.
And why does it complain that stop is not configured ?
Isn't that configured with the default of 20 sec. ? That's what crm said. See 
above. This message is also repeated nearly 7000 times in 9 minutes.
If the stop op is not configured and the cluster complains about it, why does 
it not complain about a unconfigured start op ?
That the missing monitor is complained is clear.

The DC says:
Aug  1 14:19:33 ha-idg-2 pengine[27043]:  warning: unpack_rsc_op_failure: 
Processing failed op stop for prim_drbd_idcc_devel on ha-idg-1: not configured 
(6)
Aug  1 14:19:33 ha-idg-2 pengine[27043]:    error: unpack_rsc_op: Preventing 
prim_drbd_idcc_devel from re-starting anywhere: operation stop failed 'not 
configured' (6)

Again complaining about a failed stop, saying it's not configured. Or does it 
complain that the fail of a stop op is not configured ?
The doc says:
"Some operations are generated by the cluster itself, for example, stopping and 
starting resources as needed."
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html
 . Is the doc wrong ?
What happens when i DON'T configure start/stop operations ? Are they configured 
automatically ?
I have several primitives without a configured start/stop operation, but never 
had any problems with them.

failcount is direct INFINITY:
Aug  1 14:19:33 ha-idg-1 attrd[4690]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-prim_drbd_idcc_devel (INFINITY)
Aug  1 14:19:33 ha-idg-1 attrd[4690]:   notice: attrd_perform_update: Sent 
update 8: fail-count-prim_drbd_idcc_devel=INFINITY


After exact 9 minutes the complaints about the not configured stop operation 
stopped, the complaints about missing clone-max still appears, although both 
nodes are in standby

now fail-count is 1 million:
Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-prim_drbd_idcc_devel (1000000)
Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_perform_update: Sent 
update 7076: fail-count-prim_drbd_idcc_devel=1000000

and a complain about monitor operation appeared again:
Aug  1 14:28:33 ha-idg-1 crmd[4692]:   notice: process_lrm_event: Operation 
prim_drbd_idcc_devel_monitor_60000: not configured (node=ha-idg-1, call=6968, 
rc=6, cib-update=6932, confirmed=false)
Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_cs_dispatch: Update 
relayed from ha-idg-2

crm_mon said:
Failed actions:
    prim_drbd_idcc_devel_stop_0 on ha-idg-1 'not configured' (6): call=6967, 
status=complete, exit-reason='none', last-rc-change='Tue Aug  1 14:28:33 2017', 
queued=0ms, exec=41ms
    prim_drbd_idcc_devel_monitor_60000 on ha-idg-1 'not configured' (6): 
call=6968, status=complete, exit-reason='none', last-rc-change='Tue Aug  1 
14:28:33 2017', queued=0ms, exec=41ms
    prim_drbd_idcc_devel_stop_0 on ha-idg-2 'not configured' (6): call=6963, 
status=complete, exit-reason='none', last-rc-change='Tue Aug  1 14:28:33 2017', 
queued=0ms, exec=40ms

A big problem was that i have a ClusterMon resource running on each node. It 
triggered about 20000 snmp traps in 193 seconds to my management station, which 
triggered 20000 e-Mails ...
From where comes this incredible amount of traps ? Nearly all traps said that 
stop is not configured for the drdb resource. Why complaining so often ? And 
why stopping after ~20.000 traps ?
And complaining about not configured monitor operation just 8 times.

Btw: is there a history like in the bash where i see which crm command i 
entered at which time ? I know that crm history is mighty, but didn't find that.




Bernd

-- 
Bernd Lentes 

Systemadministration 
institute of developmental genetics 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum München 
[email protected] 
phone: +49 (0)89 3187 1241 
fax: +49 (0)89 3187 2294 

no backup - no mercy
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] big trouble with a DRBD resource

Reply via email to