[ClusterLabs] Monitoring of Fence Agents?

Hayden,Robert Tue, 01 May 2018 13:50:01 -0700

Let me first thank this community for being so active.  I have learned a lot by 
watching the discussions.


I have noticed that in our environment, I am seeing high rate of my RHEL 7 
fence agent (fence_ipmilan) timing out on monitoring operations.  We use HP iLO 
3/4/5 power fencing.  I have attempted to figure out why we are seeing 
timeouts, but nothing appears to be miss-configured and there is not a pattern 
to the fence agent failures.

The timeout then shows up on pcs status output and the fence agent resource, if 
it does not relocate or restart, moves to a Stopped state.   I have tried to 
lengthen the monitoring to 15 minutes and start timeout to 125 seconds, but 
still are getting complaints from the System Admins.  They want a nice clean 
pcs status output and tend to freak out with the pcs stontih cleanup --node 
<node> command as it shows a false cycling of all resources (vip, fs, app, etc) 
for the node.

Wondering if anyone has real-world experience with some of the timeouts 
provided by fence_ipmilan with HP iLO devices.  In particular, I was looking at 
pcmk_status_timeout and pcmk_status_retries from pcs stonith describe 
fence_ipmilan -full.

How critical is the monitoring for the fence resources inside of pacemaker?   
Can I simply disable the monitoring operation?  We have an independent job that 
periodically verifies HP iLO setup for fencing (did this in RHEL 6).

>From internal R&D testing, it appears that if the fence agent is "failed" or 
>"stopped", and the cluster actually needs to fence a node, then the cluster 
>will re-attempt the fence agent start and fence the node.

Here is the config

Stonith Devices:
Resource: fence_tval13 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=X.X.X.X lanplus=1 login=XXXXX method=onoff passwd=XXXXX 
pcmk_host_list=tval13 power_wait=20 privlvl=OPERATOR
  Operations: monitor interval=15m (fence_tval13-monitor-interval-15m)
              start interval=0s timeout=125s (fence_tval13-start-interval-0s)
Resource: fence_tval14 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=Y.Y.Y.Y lanplus=1 login= YYYYY method=onoff passwd=YYYYY 
pcmk_host_list=tval14 power_wait=20 privlvl=OPERATOR
  Operations: monitor interval=15m (fence_tval14-monitor-interval-15m)
              start interval=0s timeout=125s (fence_tval14-start-interval-0s)


Thanks
Robert




CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Monitoring of Fence Agents?

Reply via email to