>>> Jan Pokorný <[email protected]> schrieb am 04.12.2019 um 21:19 in Nachricht <[email protected]>: > On 04/12/19 14:53 +0900, Ondrej wrote: >> When adding 'LSB' script to pacemaker cluster I can see that >> pacemaker advertises 'restart' and 'force‑reload' operations to be >> present ‑ regardless if the LSB script supports it or not. This >> seems to be coming from following piece of code. >> >> > https://github.com/ClusterLabs/pacemaker/blob/92b0c1d69ab1feb0b89e141b5007f87
> 92e69655e/lib/services/services_lsb.c#L39‑L40 >> >> Questions: >> 1. When the 'restart' and 'force‑reload' operations are called on >> the LSB script cluster resource? > > [reordered] > >> I would have expected that 'restart' operation would be called when >> using 'crm_resource ‑‑restart ‑‑resource myResource', but I can see >> that 'stop' and 'start' operations are used in that case instead. > > This is due to how "crm_resource ‑‑restart" is arranged, > directly in the implementation of this CLI tool itself > (see tools/crm_resource_runtime.c:cli_resource_restart): > > ‑ first, target‑role meta‑attribute for resource is set to Stopped > > ‑ then, once the activity settled, it is set back to the target‑role > it was originally at ...and if yoiu are unlocky the node processing the "restart" is fenced by a faild stop and the resource is not started again until manual intervention. This is clearly a design deficit. > > Performing this stepwise like this, there's no reasonably > implementable mapping back to a single step being the actual > composition (stop, start ‑> restart) when the plan is not shared > in full in advance (it is not) with the respective moving parts. > And there's plain common sense that would still preclude it (below). > > Hence, it is in actuality a great discovery that "restart" trigerring > verb/action is in fact completely neglected and bogus when it comes > to handling by pacemaker. If it implements any optimizations (thanks > to having the intimate knowledge of the resource at hand, plus knowing > before‑after state combo and possibly how to transition in one go), > cluster resource management won't benefit from that in any way. Time for change it seems. > > Interestingly, such optimizations are exactly what the original > OCF draft had in mind :‑) > https://github.com/ClusterLabs/OCF‑spec/blob/start/resource_agent/API/02#L225 > (even more interestingly, only to be reconsidered again some decades > later: https://github.com/ClusterLabs/OCF‑spec/issues/10; > yeah, aren't we masters of following targets moving to the extent they > are sometimes contradictory? I'd blame a desperate lack of written > [and easily obtainable] design decisions made in the past for that) > > They are mandated by LSB as well, but hey, in systemd era, we are > now _free_ to call LSB severely broken as it (shamefully, I'd say) > never even tried to accommodate proper dealing with dependency > chains (and actual serializability thereof!), as explained > in an example below. Or put in other words, LSB was never meant > to stand for a holistic resource management, something both systemd > and pacemaker attempt to cover (single/multi‑machine wide). What I stringly dislike with systemd is that is ffels to have to interfere wirth manual commands not related to systemd. For example when I mount an NFS remote filesystem locally as root, systemd interferes doing some additional actions. At home I had an external USB disk with read errors, and systemd interfered when the kernel had reported read errors from the device. OK, that was off-topic... > > OTOH, this enforced split of state transitions is perhaps what makes > the transaction (comprising perhaps countless other interdependent > resources) serializable and thus feasible at all (think: you cannot > nest any further handling ‑‑ so as to satisfy given constraints ‑‑ in > between stop and start when that's an atom, otherwise), and that's > exactly how, say, systemd approaches that, likely for that very reason: > https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb > 2db3c5 > > So I see a room for improvement here as our takeaway: > > * resource agents: > > ‑ some agents declare/implement "restart" action when there is > no practical reason to (AudibleAlarm, Xinetd, dhcpd, etc.) > [as a side note, there are non‑sensical considerations, such as > when default "start" and "stop" timeouts for dhcpd are 20 seconds > each, how come, then, that "restart" defined as "stop; start" > would also make do with 20 seconds altogher, unless there is > some amortized work I fail to see :‑)] All these suggested default timeout have to be read with some sense of humor I'm afraid. Unfortunately pacemaker has no humor yet AFAIK ;-) > > * pacemaker: > > ‑ artificially generated meta‑data mention "restart" action when > there is no good reason to (lib/services/services_lsb.c) > > ‑ there are some correct clues in Pacemaker Explained, but perhaps, > it shall take a time to emphasize that whenever "restart" is > referred, it is never an atomic step, but always a sequence > of two steps that may be considered atomic on their own, > but possibly interleaved with other steps so as to retain > soundness wrt. the imposed constraints and/or changes made > in parallel I think pacemaker should know of an "agenda" in addition to transitions. That agenda should be persistent in the CIB, so that it will survive fencing. In case of a restart it would ensure that the resource is started at the end (best effort at least). > > ‑ the same gist of "restart" shall be sketched in a help screen > of crm_resource > >> For 'force‑reload' I have no idea on how to try trigger it looking >> at 'crm_resource ‑‑help' output. > > Sorry, that's even more bogus, as there's no relevance whatsoever. > It needs to either be dropped from artificially generated meta‑data > as well, or investigated further whether there's any reason to make > of such an operation triggerable by users, and if positive, how > much of impact spread to be expected when implemented (do the > dependent services need to be reloaded or "restarted" as well, > since the change might be non‑local? any precedent there? > again, hard to analyse in the lack of written design decisions > that would provide an immediate frame for thinking about this) Actually I had used a RA in the past that was just a wrapper around a control script that allowed a "reload" (re-read the configuration file without creating a new process). Such reload was completely independent of any RA parameter change. (I would have needed to add a dummy parameter like "configurationfile_change_time" just to signal some change to the RA/pacemaker). So a command for a "reload through RA" might make sense. > > [reordered] > >> 2. How can I trigger 'restart' and 'force‑reload' operation on LSB >> script cluster resource in pacemaker? >> >> Cluster resource definition looks like this: >> <primitive class="lsb" id="myResource" type="script.sh"> >> <operations> >> <op id="myResource‑force‑reload‑interval‑0s" interval="0s" >> name="force‑reload" timeout="15s"/> >> <op id="myResource‑monitor‑interval‑15" interval="15" name="monitor" >> timeout="15"/> >> <op id="myResource‑restart‑interval‑0s" interval="0s" name="restart" >> timeout="15"/> >> <op id="myResource‑start‑interval‑0s" interval="0s" name="start" >> timeout="15"/> >> <op id="myResource‑stop‑interval‑0s" interval="0s" name="stop" >> timeout="15"/> >> </operations> >> <instance_attributes id="myResource‑instance_attributes"/> >> <meta_attributes id="myResource‑meta_attributes"/> >> </primitive> >> >> [...] >> >> I want to make sure that cluster will not attempt running 'restart' >> nor 'force‑reload' on script that is not implementing them. > > Understood, I am reasonably sure about the former and definitely sure > about the latter, in the current state of implementation anyway. > That you even need to stress about these bogus circumstances doesn't > put us in a good light, but the more important this feedback loop is. > >> As for now I'm considering to return exit code '3' from script when >> these actions are called to indicate that they are 'unimplemented >> feature' as suggested by LSB specification below. However I would >> like to verify that this works as expected. >> > http://refspecs.linuxfoundation.org/LSB_5.0.0/LSB‑Core‑generic/LSB‑Core‑generic/i > niscrptact.html > > If your resource is solely to be run under pacemaker, I'd prune > all those those quirks altogethher, to make one's life easier. > > ‑‑ > Jan (Poki) _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
