On Mon, 2021-07-26 at 12:25 -0400, Digimer wrote:
> On 2021-07-26 9:54 a.m., [email protected] wrote:
> > On Fri, 2021-07-23 at 21:46 -0400, Digimer wrote:
> > > After a LOT of hassle, I finally got it updated, but OMG it was
> > > painful.
> > > 
> > > I degraded the cluster (unsure if needed), set maintenance mode,
> > > deleted
> > > the stonith levels, deleted the stonith devices, recreated them
> > > with
> > > the
> > > updated values, recreated the stonith levels, and finally
> > > disabled
> > > maintenance mode.
> > > 
> > > It should not have been this hard, right? Why is heck would it be
> > > that
> > > pacemaker kept "rolling back" to old configs? I'd delete the
> > > stonith
> > 
> > That is bizarre. It sounds like the CIB changes were taking effect
> > locally, then being rejected by the rest of the cluster, which
> > would
> > send the "correct" CIB back to the originator.
> > 
> > The logs of interest would be pacemaker.log from both nodes at the
> > time
> > you made the first configuration change that failed. I'm guessing
> > the
> > logs you posted were from after that point?
> 
> Below are the logs. The change appears to first try at 'Jul 23
> 16:22:27', made on an-a02n01, included logs for a few minutes before
> in case relevant. 
> * an-a02n01: 
> https://www.alteeve.com/an-repo/files/an-a02n01.pacemaker.log
> * an-a02n02: 
> https://www.alteeve.com/an-repo/files/an-a02n02.pacemaker.log
> Note that the PDUs as originally configured (10.201.2.1/2) were not
> available, so I had to disable and cleanup the stonith resources.
> They seemed to keep getting re-enabled, so I got to the habit of
> doing this cycle of disable -> cleanup -> disable -> cleanup before I
> could reliably get the resources to be 'stopped (disabled)' in 'pcs
> stonith status'.
> digimer

The initial change happened here:

Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: Diff: --- 0.337.112 2
Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: Diff: +++ 0.338.0 6a24af66df3d9f825cc2681222f8f5d6
Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: +  /cib:  @epoch=338, @num_updates=0
Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: +  
/cib/configuration/resources/primitive[@id='apc_snmp_node1_an-pdu03']/instance_attributes[@id='apc_snmp_node1_an-pdu03-instance_attributes']/nvpair[@id='apc_snmp_node1_an-pdu03-instance_attributes-ip']:
  @value=10.201.2.3
Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_replace_notify)         info: Replaced: 0.337.112 -> 0.338.0 from an-a02n02
Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_process_request)        info: Completed cib_replace operation for section 
configuration: OK (rc=0, origin=an-a02n02/cibadmin/2, version=0.338.0)

origin=an-a02n02/cibadmin/2 means that someone or something ran the
cibadmin tool on an-02n02. Presumably this was your interactive pcs
command.

It was then reverted by:

Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: Diff: --- 0.343.3 2
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: Diff: +++ 0.344.0 (null)
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: +  /cib:  @epoch=344, @num_updates=0
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: ++ /cib/configuration/resources:  <primitive 
class="stonith" id="apc_snmp_node1_an-pdu03" type="fence_apc_snmp"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: ++                                  
<instance_attributes id="apc_snmp_node1_an-pdu03-instance_attributes">
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: ++                                    <nvpair 
id="apc_snmp_node1_an-pdu03-instance_attributes-ip" name="ip" 
value="10.201.2.1"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: ++                                    <nvpair 
id="apc_snmp_node1_an-pdu03-instance_attributes-pcmk_host_list" 
name="pcmk_host_list" value="an-a02n01"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: ++                                    <nvpair 
id="apc_snmp_node1_an-pdu03-instance_attributes-pcmk_off_action" 
name="pcmk_off_action" value="reboot"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: ++                                    <nvpair 
id="apc_snmp_node1_an-pdu03-instance_attributes-port" name="port" value="5"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: ++                                  
</instance_attributes>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: ++                                  <operations>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: ++                                    <op 
id="apc_snmp_node1_an-pdu03-monitor-interval-60" interval="60" name="monitor"/>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: ++                                  </operations>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_perform_op)     info: ++                                </primitive>
Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based     [121628] 
(cib_process_request)        info: Completed cib_apply_diff operation for 
section 'all': OK (rc=0, origin=an-a02n02/cibadmin/2, version=0.344.0)

Notice the origin is still cibadmin on an-a02n02. So this was either
you, or a script or cron on that node. I don't see any additional
details on that node.
-- 
Ken Gaillot <[email protected]>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to