On Tue, 2020-09-15 at 13:25 +0200, Lars Ellenberg wrote: > On Fri, Sep 11, 2020 at 11:42:46AM +0200, Lars Ellenberg wrote: > > On Thu, Sep 10, 2020 at 11:18:58AM -0500, Ken Gaillot wrote: > > > > But for some unrelated reason (stress on the cib, IPC timeout), > > > > crmd on the DC was doing an error exit and was respawned: > > > > > > > > cib: info: cib_process_ping: Reporting our current > > > > digest > > > > crmd: error: do_pe_invoke_callback: Could not retrieve > > > > the > > > > Cluster Information Base: Timer expired > > > > ... > > > > pacemakerd: error: pcmk_child_exit: The crmd process > > > > (17178) > > > > exited: Generic Pacemaker error (201) > > > > pacemakerd: notice: pcmk_process_exit: Respawning failed > > > > child > > > > process: crmd > > > > > > > > The new DC now causes: > > > > cib: info: cib_perform_op: Diff: --- 0.971.201 2 > > > > cib: info: cib_perform_op: Diff: +++ 0.971.202 (null) > > > > cib: info: cib_perform_op: -- > > > > /cib/status/node_state[@id='2']/transient_attributes[@id='2'] > > > > > > > > But the attrd apparently does not notice that transient > > > > attributes it > > > > had cached are now gone. > > > > > > This is a known issue. There was some work done on it in stages > > > that > > > never went anywhere: > > > > > > https://github.com/ClusterLabs/pacemaker/pull/1695 > > > > > > https://github.com/ClusterLabs/pacemaker/pull/1699 > > > > > > https://github.com/ClusterLabs/pacemaker/pull/2020 > > > > > > The basic idea is that the controller should ask pacemaker-attrd > > > to > > > clear a node's transient attributes rather than doing so > > > directly, so > > > attrd and the CIB stay in sync. Backward compatibility would be > > > tricky. > > > > > > The fix would only be in Pacemaker 2, since this would require a > > > feature set bump, which can't be backported. > > > > Thank you for that quick response and all the context above. > > > > You mention below > > > > > the controller > > > should request node attribute erasure only if the node leaves the > > > corosync membership, not just the controller CPG. > > > > Would that be a change that could go into the 1.1.x series? > > Suggestion to mitigate the issue: > > periodically, for example from a monitor action of a simple resource > agent script, do: > > if attrd_updater -n attrd-canary --update 1; then > crm_attribute --lifetime reboot --name attrd-canary --query || > attrd_updater --refresh > fi > > Do you see any possible issues with that approach? > > Lars
That should work. -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
