Let me first thank this community for being so active. I have learned a lot by
watching the discussions.
I have noticed that in our environment, I am seeing high rate of my RHEL 7
fence agent (fence_ipmilan) timing out on monitoring operations. We use HP iLO
3/4/5 power fencing. I have attempted to figure out why we are seeing
timeouts, but nothing appears to be miss-configured and there is not a pattern
to the fence agent failures.
The timeout then shows up on pcs status output and the fence agent resource, if
it does not relocate or restart, moves to a Stopped state. I have tried to
lengthen the monitoring to 15 minutes and start timeout to 125 seconds, but
still are getting complaints from the System Admins. They want a nice clean
pcs status output and tend to freak out with the pcs stontih cleanup --node
<node> command as it shows a false cycling of all resources (vip, fs, app, etc)
for the node.
Wondering if anyone has real-world experience with some of the timeouts
provided by fence_ipmilan with HP iLO devices. In particular, I was looking at
pcmk_status_timeout and pcmk_status_retries from pcs stonith describe
fence_ipmilan -full.
How critical is the monitoring for the fence resources inside of pacemaker?
Can I simply disable the monitoring operation? We have an independent job that
periodically verifies HP iLO setup for fencing (did this in RHEL 6).
>From internal R&D testing, it appears that if the fence agent is "failed" or
>"stopped", and the cluster actually needs to fence a node, then the cluster
>will re-attempt the fence agent start and fence the node.
Here is the config
Stonith Devices:
Resource: fence_tval13 (class=stonith type=fence_ipmilan)
Attributes: ipaddr=X.X.X.X lanplus=1 login=XXXXX method=onoff passwd=XXXXX
pcmk_host_list=tval13 power_wait=20 privlvl=OPERATOR
Operations: monitor interval=15m (fence_tval13-monitor-interval-15m)
start interval=0s timeout=125s (fence_tval13-start-interval-0s)
Resource: fence_tval14 (class=stonith type=fence_ipmilan)
Attributes: ipaddr=Y.Y.Y.Y lanplus=1 login= YYYYY method=onoff passwd=YYYYY
pcmk_host_list=tval14 power_wait=20 privlvl=OPERATOR
Operations: monitor interval=15m (fence_tval14-monitor-interval-15m)
start interval=0s timeout=125s (fence_tval14-start-interval-0s)
Thanks
Robert
CONFIDENTIALITY NOTICE This message and any included attachments are from
Cerner Corporation and are intended only for the addressee. The information
contained in this message is confidential and may constitute inside or
non-public information under international, federal, or state securities laws.
Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the
addressee, please promptly delete this message and notify the sender of the
delivery error by e-mail or you may call Cerner's corporate offices in Kansas
City, Missouri, U.S.A at (+1) (816)221-1024.
_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org