On Tue, 2022-06-14 at 15:53 +0200, Ulrich Windl wrote: > > > > Ken Gaillot <[email protected]> schrieb am 14.06.2022 um > > > > 15:49 in > Nachricht > <[email protected]>: > > On Tue, 2022‑06‑14 at 14:36 +0200, Ulrich Windl wrote: > > > Hi! > > > > > > I had a case where a VirtualDomain monitor operation ended in a > > > core > > > dump (actually it was pacemaker‑execd, but it counted as > > > "monitor" > > > operation), and the cluster decided to restart the VM. Wouldn't > > > it be > > > worth to retry the monitor operation first? > > > > It counts like any other monitor failure > > > > > Chances are that a re‑tried monitor operation returns a better > > > status > > > than segmentation fault. > > > Or dies the logic just ignore processes dying on signals? > > > > > > 20201202.ba59be712‑150300.4.21.1.x86_64 (SLES15 SP3) > > > > > > Jun 14 14:09:16 h19 systemd‑coredump[28788]: Process 28786 > > > (pacemaker‑execd) of user 0 dumped core. > > > Jun 14 14:09:16 h19 pacemaker‑execd[7440]: warning: > > > prm_xen_v04_monitor_600000[28786] terminated with signal: > > > Segmentation fault > > > > This means that the child process forked to execute the resource > > agent > > segfaulted, which is odd. > > Yes it's odd, but isn't the cluster just to protect us from odd > situations? > ;-) > > > Is the agent a compiled program? If not, it's possible the tiny > > amount > > of pacemaker code that executes the agent is what segfaulted. Do > > you > > have the actual core, and can you do a backtrace? > > Believe me, it's just "odd": > Stack trace of > thread > 28786: > #0 0x00007f85589e0 > bcf > __libc_fork (/lib64/libc-2.31.so + 0xe1bcf) > #1 0x00007f855949b > 85d n/a > (/usr/lib64/libcrmservice.so.28.2.2 + 0x785d) > #2 0x00007f855949a > 5e3 n/a > (/usr/lib64/libcrmservice.so.28.2.2 + 0x65e3) > #3 0x00007f8558d47 > 0ed n/a > (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x530ed) > #4 0x00007f8558d46 > 624 > g_main_context_dispatch (/usr/lib64/libglib-2.0.so.0.6200.6 + > 0x52624) > #5 0x00007f8558d46 > 9c0 n/a > (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x529c0) > #6 0x00007f8558d46 > c82 > g_main_loop_run (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x52c82) > #7 0x0000558c07659 > 30b n/a > (/usr/lib/pacemaker/pacemaker-execd + 0x330b) > #8 0x00007f8558934 > 2bd > __libc_start_main (/lib64/libc-2.31.so + 0x352bd) > #9 0x0000558c07659 > 3da n/a > (/usr/lib/pacemaker/pacemaker-execd + 0x33da) > > Rumors say it's Dell's dcdbas module combined with Xen and an AMD CPU > plus > some software bugs ;-)
From the above, it looks like the C library's fork() is what's segfaulting, so yeah probably not much to do about it ... > > Regards, > Ulrich > > > > Jun 14 14:09:16 h19 pacemaker‑controld[7443]: error: Result of > > > monitor operation for prm_xen_v04 on h19: Error > > > Jun 14 14:09:16 h19 pacemaker‑controld[7443]: notice: Transition > > > 9 > > > action 107 (prm_xen_v04_monitor_600000 on h19): expected 'ok' but > > > got > > > 'error' > > > ... > > > Jun 14 14:09:16 h19 pacemaker‑schedulerd[7442]: notice: * > > > Recover prm_xen_v04 ( h19 ) > > > > > > Regards, > > > ulrich > > > > > > > > > > > > _______________________________________________ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > ‑‑ > > Ken Gaillot <[email protected]> > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
