In clouds you can't just use VIPs.Use azure-lb resource instead. Best Regards,Strahil Nikolov On Fri, Jul 29, 2022 at 23:21, Reid Wahl<[email protected]> wrote: On Fri, Jul 29, 2022 at 1:02 PM Reid Wahl <[email protected]> wrote: > > On Fri, Jul 29, 2022 at 12:52 PM Ross Sponholtz <[email protected]> > wrote: > > > > I’m running a RHEL pacemaker cluster on Azure, and I’ve gotten a failure & > > fencing where I get these messages in the log file: > > > > > > warning: vip_ABC_30_monitor_10000 process (PID 1779737) timed out > > crit: vip_ABC_30_monitor_10000 process (PID 1779737) will not die! > > > > > > > > This resource uses the IPAddr2 resource agent. I’ve looked at the agent > > code, and I can’t pinpoint any reason it would hang up, and since the node > > gets fenced, I can’t tell why this happens – any ideas on what kinds of > > failures could cause this problem? > > > > > > > > Thanks, > > > > Ross > > > > Are you able to reproduce this? I suggest adding `trace_ra=1` to the > resource configuration in order to determine where it's hanging. > > # pcs resource update vip_ABC trace_ra=1 > > This will produce a shell trace of each operation in > /var/lib/heartbeat/trace_ra/IPaddr2. This is naturally quite a lot of > logging, so remove the option when you've gotten what you need. > > # pcs resource update vip_ABC trace_ra= > > Also discussed in this article (you should have access if you're on RHEL): > - How can I determine exactly what is happening with every operation > on a resource in Pacemaker? > (https://access.redhat.com/solutions/3182931)
You may also want to set on-fail=block for the stop operation to prevent the node from getting fenced while you troubleshoot this. # pcs resource update vip_ABC op stop interval=0s timeout=<whatever_the_current_timeout_is> on-fail=block Other than that, trace_ra=1 will generally tell us quite a lot -- I just hope that it _does_ get written, given that the child process becomes unkillable. The IPaddr2 resource agent doesn't do all that much. It runs a few `ip` commands and sends an ARP refresh. That's about it. Generally would not expect any of those to hang unless there's a deeper issue. > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > -- > Regards, > > Reid Wahl (He/Him) > Senior Software Engineer, Red Hat > RHEL High Availability - Pacemaker -- Regards, Reid Wahl (He/Him) Senior Software Engineer, Red Hat RHEL High Availability - Pacemaker _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
