On 20/08/18 10:51 +0200, Ulrich Windl wrote: > I wonder whether it's possible to run a monitoring op only if some > specific resource is up. > Background: We have some resource that runs fine without NFS, but > the start, stop and monitor operations will just hang if NFS is > down. In effect the monitor operation will time out, the cluster > will try to recover, calling the stop operation, which in turn will > time out, making things worse (i.e.: causing a node fence). > > So my idea was to pause the monitoing operation while NFS is down > (NFS itself is controlled by the cluster and should recover "rather > soon" TM). > > Is that possible? > And before you ask: No, I have not written that RA that has the > problem; a multi-million-dollar company wrote it (Years before I had > written a monitor for HP-UX' cluster that did not have this problem, > even though the configuration files were read from NFS (It's not > magic: Just periodically copy them to shared memory, and read the > config from shared memory).
Sorry for stating likely obvious; in a similar spirit, if the agent at hand allows configuring the config location, you can synchronize the shared copy in the offline node-local mirrors, e.g. using csync2. The problem then boils down to whether "cluster approved, synchronized and fresh" version is what gets used. It doesn't look there's any silver bullet, any attempt to overcome "holistic integrity" (on its own the native approach with pacemaker, anything else is swimming against the stream) may bite you/affect HA at some possibly unanticipated point. If you don't want or cannot mangle (wrap call outs, etc.) with the resource agents, your best bet is to ask the respective author/vendor to honour OCF_CHECK_LEVEL[1] in "monitor" action properly, meaning that no file-based traversal (possibly getting stuck on NFS access) would be attempted by default (level "0", but could be with level of "10" or more), and do not set it artificially to higher levels in your configuration (or conditionalize similarly to what Ken suggested). Apparently, this won't fix "stop" issues, for instance. [1] https://github.com/ClusterLabs/OCF-spec/blob/42697cc9fd716173c7da6fa67148dd579282da96/ra/1.0/resource-agent-api.md#parameters-specific-to-the-monitor-action -- Nazdar, Jan (Poki)
pgptP2PzMxxeI.pgp
Description: PGP signature
_______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
