On Mon, 2021-07-26 at 12:25 -0400, Digimer wrote: > On 2021-07-26 9:54 a.m., [email protected] wrote: > > On Fri, 2021-07-23 at 21:46 -0400, Digimer wrote: > > > After a LOT of hassle, I finally got it updated, but OMG it was > > > painful. > > > > > > I degraded the cluster (unsure if needed), set maintenance mode, > > > deleted > > > the stonith levels, deleted the stonith devices, recreated them > > > with > > > the > > > updated values, recreated the stonith levels, and finally > > > disabled > > > maintenance mode. > > > > > > It should not have been this hard, right? Why is heck would it be > > > that > > > pacemaker kept "rolling back" to old configs? I'd delete the > > > stonith > > > > That is bizarre. It sounds like the CIB changes were taking effect > > locally, then being rejected by the rest of the cluster, which > > would > > send the "correct" CIB back to the originator. > > > > The logs of interest would be pacemaker.log from both nodes at the > > time > > you made the first configuration change that failed. I'm guessing > > the > > logs you posted were from after that point? > > Below are the logs. The change appears to first try at 'Jul 23 > 16:22:27', made on an-a02n01, included logs for a few minutes before > in case relevant. > * an-a02n01: > https://www.alteeve.com/an-repo/files/an-a02n01.pacemaker.log > * an-a02n02: > https://www.alteeve.com/an-repo/files/an-a02n02.pacemaker.log > Note that the PDUs as originally configured (10.201.2.1/2) were not > available, so I had to disable and cleanup the stonith resources. > They seemed to keep getting re-enabled, so I got to the habit of > doing this cycle of disable -> cleanup -> disable -> cleanup before I > could reliably get the resources to be 'stopped (disabled)' in 'pcs > stonith status'. > digimer
The initial change happened here: Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: --- 0.337.112 2 Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: +++ 0.338.0 6a24af66df3d9f825cc2681222f8f5d6 Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: + /cib: @epoch=338, @num_updates=0 Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: + /cib/configuration/resources/primitive[@id='apc_snmp_node1_an-pdu03']/instance_attributes[@id='apc_snmp_node1_an-pdu03-instance_attributes']/nvpair[@id='apc_snmp_node1_an-pdu03-instance_attributes-ip']: @value=10.201.2.3 Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_replace_notify) info: Replaced: 0.337.112 -> 0.338.0 from an-a02n02 Jul 23 16:22:27 an-a02n01.alteeve.com pacemaker-based [121628] (cib_process_request) info: Completed cib_replace operation for section configuration: OK (rc=0, origin=an-a02n02/cibadmin/2, version=0.338.0) origin=an-a02n02/cibadmin/2 means that someone or something ran the cibadmin tool on an-02n02. Presumably this was your interactive pcs command. It was then reverted by: Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: --- 0.343.3 2 Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: Diff: +++ 0.344.0 (null) Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: + /cib: @epoch=344, @num_updates=0 Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ /cib/configuration/resources: <primitive class="stonith" id="apc_snmp_node1_an-pdu03" type="fence_apc_snmp"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <instance_attributes id="apc_snmp_node1_an-pdu03-instance_attributes"> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-ip" name="ip" value="10.201.2.1"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-a02n01"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-pcmk_off_action" name="pcmk_off_action" value="reboot"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <nvpair id="apc_snmp_node1_an-pdu03-instance_attributes-port" name="port" value="5"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ </instance_attributes> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <operations> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ <op id="apc_snmp_node1_an-pdu03-monitor-interval-60" interval="60" name="monitor"/> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ </operations> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_perform_op) info: ++ </primitive> Jul 23 16:22:50 an-a02n01.alteeve.com pacemaker-based [121628] (cib_process_request) info: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=an-a02n02/cibadmin/2, version=0.344.0) Notice the origin is still cibadmin on an-a02n02. So this was either you, or a script or cron on that node. I don't see any additional details on that node. -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
