Re: [ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

Ken Gaillot Tue, 14 Jun 2022 08:00:14 -0700

On Tue, 2022-06-14 at 15:53 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot <[email protected]> schrieb am 14.06.2022 um
> > > > 15:49 in
> Nachricht
> <[email protected]>:
> > On Tue, 2022‑06‑14 at 14:36 +0200, Ulrich Windl wrote:
> > > Hi!
> > > 
> > > I had a case where a VirtualDomain monitor operation ended in a
> > > core
> > > dump (actually it was pacemaker‑execd, but it counted as
> > > "monitor"
> > > operation), and the cluster decided to restart the VM. Wouldn't
> > > it be
> > > worth to retry the monitor operation first?
> > 
> > It counts like any other monitor failure
> > 
> > > Chances are that a re‑tried monitor operation returns a better
> > > status
> > > than segmentation fault.
> > > Or dies the logic just ignore processes dying on signals?
> > > 
> > > 20201202.ba59be712‑150300.4.21.1.x86_64 (SLES15 SP3)
> > > 
> > > Jun 14 14:09:16 h19 systemd‑coredump[28788]: Process 28786
> > > (pacemaker‑execd) of user 0 dumped core.
> > > Jun 14 14:09:16 h19 pacemaker‑execd[7440]:  warning:
> > > prm_xen_v04_monitor_600000[28786] terminated with signal:
> > > Segmentation fault
> > 
> > This means that the child process forked to execute the resource
> > agent
> > segfaulted, which is odd.
> 
> Yes it's odd, but isn't the cluster just to protect us from odd
> situations?
> ;-)
> 
> > Is the agent a compiled program? If not, it's possible the tiny
> > amount
> > of pacemaker code that executes the agent is what segfaulted. Do
> > you
> > have the actual core, and can you do a backtrace?
> 
> Believe me, it's just "odd":
>                                                   Stack trace of
> thread
> 28786:
>                                                   #0  0x00007f85589e0
> bcf
> __libc_fork (/lib64/libc-2.31.so + 0xe1bcf)
>                                                   #1  0x00007f855949b
> 85d n/a
> (/usr/lib64/libcrmservice.so.28.2.2 + 0x785d)
>                                                   #2  0x00007f855949a
> 5e3 n/a
> (/usr/lib64/libcrmservice.so.28.2.2 + 0x65e3)
>                                                   #3  0x00007f8558d47
> 0ed n/a
> (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x530ed)
>                                                   #4  0x00007f8558d46
> 624
> g_main_context_dispatch (/usr/lib64/libglib-2.0.so.0.6200.6 +
> 0x52624)
>                                                   #5  0x00007f8558d46
> 9c0 n/a
> (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x529c0)
>                                                   #6  0x00007f8558d46
> c82
> g_main_loop_run (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x52c82)
>                                                   #7  0x0000558c07659
> 30b n/a
> (/usr/lib/pacemaker/pacemaker-execd + 0x330b)
>                                                   #8  0x00007f8558934
> 2bd
> __libc_start_main (/lib64/libc-2.31.so + 0x352bd)
>                                                   #9  0x0000558c07659
> 3da n/a
> (/usr/lib/pacemaker/pacemaker-execd + 0x33da)
> 
> Rumors say it's Dell's dcdbas module combined with Xen and an AMD CPU
> plus
> some software bugs ;-)


From the above, it looks like the C library's fork() is what's
segfaulting, so yeah probably not much to do about it ...

> 
> Regards,
> Ulrich
> 
> > > Jun 14 14:09:16 h19 pacemaker‑controld[7443]:  error: Result of
> > > monitor operation for prm_xen_v04 on h19: Error
> > > Jun 14 14:09:16 h19 pacemaker‑controld[7443]:  notice: Transition
> > > 9
> > > action 107 (prm_xen_v04_monitor_600000 on h19): expected 'ok' but
> > > got
> > > 'error'
> > > ...
> > > Jun 14 14:09:16 h19 pacemaker‑schedulerd[7442]:  notice:  *
> > > Recover    prm_xen_v04              (             h19 )
> > > 
> > > Regards,
> > > ulrich
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > 
> > > ClusterLabs home: https://www.clusterlabs.org/ 
> > > 
> > ‑‑ 
> > Ken Gaillot <[email protected]>
> > 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <[email protected]>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

Reply via email to