On 2020-10-07 2:35 a.m., Ulrich Windl wrote: >>>> Digimer <[email protected]> schrieb am 07.10.2020 um 05:42 in Nachricht > <[email protected]>: >> Hi all, >> >> While developing our program (and not being a production cluster), I >> find that when I push broken code to a node, causing the RA to fail to >> perform an operation, the node gets fenced. (example below). > > (I see others have replied, too, but anyway) > Specifically it's the "stop" operation that may not fail. > >> >> This brings up a question; >> >> If a single resource fails for any reason and can't be recovered, but >> other resources on the node are still operational, how can I suppress a >> self-fence? I'd rather one failed resource than having all resources get >> killed (they're VMs, so restarting on the peer is ... disruptive). > > I think you can (on-fail=block (AFAIR). > Note: This is not a political statement for any near elections ;-)
Indeed, and this works. I misunderstood the pcs syntax and applied the 'on-fail="stop"' to the monitor operation... Woops. >> If this is a bad approach (sufficiently bad to justify hard-rebooting >> other VMs that had been running on the same node), why is that? Are >> there any less-bad options for this scenario? >> >> Obviously, I would never push untested code to a production system, >> but knowing now that this is possible (losing a node with it's other VMs >> on an RA / code fault), I'm worried about some unintended "oops" causing >> the loss of a node. >> >> For example, would it be possible to have the node try to live migrate >> services to the other peer, before self-fencing in a scenario like this? > > As there is guarantee that migration will succeed without fencing the node it > could only be done with a timeout; otherwise the node will be hanging while > waiting for migration to succeed. I figured as much. >> Are there other options / considerations I might be missing here? >> >> example VM config: >> >> ==== >> <primitive class="ocf" id="srv07-el6" provider="alteeve" >> type="server"> >> <instance_attributes id="srv07-el6-instance_attributes"> >> <nvpair id="srv07-el6-instance_attributes-name" name="name" >> value="srv07-el6"/> >> </instance_attributes> >> <meta_attributes id="srv07-el6-meta_attributes"> >> <nvpair id="srv07-el6-meta_attributes-allow-migrate" >> name="allow-migrate" value="true"/> >> <nvpair id="srv07-el6-meta_attributes-migrate_to" >> name="migrate_to" value="INFINITY"/> >> <nvpair id="srv07-el6-meta_attributes-stop" name="stop" >> value="INFINITY"/> >> <nvpair id="srv07-el6-meta_attributes-target-role" >> name="target-role" value="Stopped"/> >> </meta_attributes> >> <operations> >> <op id="srv07-el6-migrate_from-interval-0s" interval="0s" >> name="migrate_from" timeout="600"/> >> <op id="srv07-el6-migrate_to-interval-0s" interval="0s" >> name="migrate_to" timeout="INFINITY"/> >> <op id="srv07-el6-monitor-interval-60" interval="60" >> name="monitor" on-fail="block"/> >> <op id="srv07-el6-notify-interval-0s" interval="0s" >> name="notify" timeout="20"/> >> <op id="srv07-el6-start-interval-0s" interval="0s" >> name="start" timeout="30"/> >> <op id="srv07-el6-stop-interval-0s" interval="0s" name="stop" >> timeout="INFINITY"/> >> </operations> >> </primitive> >> ==== >> >> Logs from a code oops in the RA triggering a node self-fence; >> >> ==== >> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]: notice: >> srv07-el6_stop_0:36779:stderr [ DBD::Pg::db do failed: ERROR: syntax >> error at or near "3" ] >> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]: notice: >> srv07-el6_stop_0:36779:stderr [ LINE 1: ...ut off, server_boot_time = 0 >> WHERE server_uuid = '3d73db4c-d... ] >> Oct 06 23:33:54 mk-a02n01.digimer.ca pacemaker-execd[33816]: notice: >> srv07-el6_stop_0:36779:stderr [ >> ^ at /usr/share/perl5/Anvil/Tools/Database.pm line >> 13791. ] > > As I'm writing a lot of Perl code, too: Do you know "perl -c" to check the > syntax, BTW? > > And don't forget ocf-tester. ;-) I did not know about ocf-tester, thanks for the hint. As for 'perl -c', the issue above was caused by a bad SQL statement, don't think perl can catch that. :) -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
