On 02/19/2019 06:21 PM, Edwin Török wrote: > > On 19/02/2019 17:02, Klaus Wenninger wrote: >> On 02/19/2019 05:41 PM, Edwin Török wrote: >>> On 19/02/2019 16:26, Edwin Török wrote: >>>> On 18/02/2019 18:27, Edwin Török wrote: >>>>> Did a test today with CentOS 7.6 with upstream kernel and with >>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our >>>>> patched [1] SBD) and was not able to reproduce the issue yet. >>>> I was able to finally reproduce this using only upstream components >>>> (although it seems to be easier to reproduce if we use our patched SBD, >>>> I was able to reproduce this by using only upstream packages unpatched >>>> by us): >> Just out of curiosity: What did you patch in SBD? >> Sorry if I missed the answer in the previous communication. > It is mostly this PR, which calls getquorate quite often (a more > efficient impl. would be to use the quorum notification API like > dlm/pacemaker do, although see concerns in > https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html): > https://github.com/ClusterLabs/sbd/pull/27
Ooh yes totally forgotten about that ... bad conscience ... > > We have also added our own servant for watching the health of our > control plane, but that is not relevant to this bug (it reproduces with > that watcher turned off too). > >>> I was also able to get a corosync blackbox from one of the stuck VMs >>> that showed something interesting: >>> https://clbin.com/d76Ha >>> >>> It is looping on: >>> debug Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed >>> (non-critical): Resource temporarily unavailable (11) >> Hmm ... something like tx-queue of the device full, or no buffers >> available anymore and kernel-thread doing the cleanup isn't >> scheduled ... > Yes that is very plausible. Perhaps it'd be nicer if corosync went back > to the epoll_wait loop when it gets too many EAGAINs from sendmsg. > (although this seems different from the original bug where it got stuck > in epoll_wait) > >> Does the kernel log anything in that situation? > Other than the crmd segfault no. > From previous observations on xenserver the softirqs were all stuck on > the CPU that corosync hogged 100% (I'll check this on upstream, but I'm > fairly sure it'll be the same). softirqs do not run at realtime priority > (if we increase the priority of ksoftirqd to realtime then it all gets > unstuck), but seem to be essential for whatever corosync is stuck > waiting on, in this case likely the sending/receiving of network packets. > > I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see > why this was only reproducible on 4.19 so far. Maybe an issue of that kernel with distributing the load over cores ... Can you provoke it by trying on a single-core or doing some pinning of softirqd and corosync to the same core? Just unfortunate that this is the LTS ... > > Best regards, > --Edwin _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
