On 14/02/2019 17:33, Edwin Török wrote: > Hello, > > We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and > noticed a fundamental problem with realtime priorities: > - corosync runs on CPU3, and interrupts for the NIC used by corosync are > also routed to CPU3 > - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but > without it packets sent/received from that interface would not get processed > - corosync is in a busy loop using 100% CPU, never giving a chance for > softirqs to be processed (including TIMER and SCHED) >
Can you tell me what distribution this is please? There are patches to libqb that should be applied to fix a similar problem in 1.0.1-6 - but that's a RHEL version and kernel 4.19 is not a RHEL 7 kernel, so I just need to be sure that those fixes are in your libqb before going any further. Without doubt this is a bug, in normal operation corosync is quite light on CPU. Chrissie > This appears to be a priority inversion problem, if corosync runs as > realtime then everything it needs (timers...) should be realtime as > well, otherwise running as realtime guarantees we'll miss the watchdog > deadline, instead of guaranteeing that we process the data before the > deadline. > > Do you have some advice on what the expected realtime priorities would > be for: > - corosync > - sbd > - hard irqs > - soft irqs > > Also would it be possible for corosync to avoid hogging the CPU in libqb? > (Our hypothesis is that if softirqs are not processed then timers > wouldn't work for processes on that CPU either) > > Some more background and analysis: > > We noticed that cluster membership changes were very unstable > (especially as you approach 16 physical host clusters) causing the whole > cluster to fence when adding a single node. This was previously working > fairly reliably with a 4.4 based kernel. > > I've increased SBD timeout to 3600 to be able to investigate the problem > and noticed that corosync was using 100% of CPU3 [1] and I immediately > lost SSH access on eth0 (corosync was using eth1), where eth0's > interrupts were also processed on CPU3 (and irqbalance didn't move it). > > IIUC SCHED_RR tasks should not be able to take up 100% of CPU, according > to [3] it shouldn't be allowed to use more than 95% of CPU. > > Softirqs were not processed at all on CPU3 (see [2], the numbers in the > CPU3 column did not change, the numbers in the other columns did). > Tried decreasing priority of corosync using chrt to 1, which didn't > help. I then increased the priority of ksoftirqd to 50 using chrt, which > immediately solved the CPU usage problem on corosync. > > I tried a simple infinite loop program with realtime priority, but it > didn't reproduce the problems with interrupts getting stuck. > > > Three problems here: > * all softirqs were stuck (not being processed) on CPU3, which included > TIMER and SCHED. corosync relies quite heavily on timers, would lack of > processing them cause the 100% CPU usage? > * is there a kernel bug introduced between 4.4 - 4.19 that causes > realtime tasks to not respect the 95% limit anymore? This would leave 5% > time for IRQs, including NIC IRQs > * if corosync runs at higher priority than the kernel softirq thread > processing NIC IRQ how is corosync expecting incoming packets to be > processed, if it is hogging the CPU by receiving, polling and sending > packets? > > On another host which exhibited the same problem I've run strace (which > also had the side-effect of getting corosync unstuck from 100% CPU use > after strace finished): > 1 bind > 5 close > 688 epoll_wait > 8 futex > 1 getsockname > 3 ioctl > 1 open > 3 recvfrom > 190 recvmsg > 87 sendmsg > 9 sendto > 4 socket > 6 write > > On yet another host I've run gdb while corosync was stuck: > Thread 2 (Thread 0x7f6fd0c9b700 (LWP 16245)): > #0 0x00007f6fd34a0afb in do_futex_wait.constprop.1 () > from /lib64/libpthread.so.0 > #1 0x00007f6fd34a0b8f in __new_sem_wait_slow.constprop.0 () > from /lib64/libpthread.so.0 > #2 0x00007f6fd34a0c2b in sem_wait@@GLIBC_2.2.5 () from > /lib64/libpthread.so.0 > #3 0x00007f6fd3b38991 in qb_logt_worker_thread () from /lib64/libqb.so.0 > #4 0x00007f6fd349ae25 in start_thread () from /lib64/libpthread.so.0 > #5 0x00007f6fd31c4bad in clone () from /lib64/libc.so.6 > > Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)): > #0 0x00007f6fd31c5183 in epoll_wait () from /lib64/libc.so.6 > #1 0x00007f6fd3b3dea8 in poll_and_add_to_jobs () from /lib64/libqb.so.0 > #2 0x00007f6fd3b2ed93 in qb_loop_run () from /lib64/libqb.so.0 > #3 0x000055592d62ff78 in main () > > > [1] > top - 15:51:38 up 47 min, 2 users, load average: 3.81, 1.70, 0.70 > Tasks: 208 total, 4 running, 130 sleeping, 0 stopped, 0 zombie > %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu1 : 53.8 us, 46.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu4 : 3.8 us, 0.0 sy, 0.0 ni, 96.2 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > %Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > KiB Mem : 4210184 total, 3374752 free, 387380 used, 448052 buff/cache > KiB Swap: 1048572 total, 1048572 free, 0 used. 3660276 avail Mem > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 16217 root rt 0 194832 92532 66128 R 104.0 2.2 2:54.00 corosync > 24004 root 20 0 161984 4432 3696 R 4.0 0.1 0:00.02 top > > cat /proc/interrupts |grep eth > 88: 2 0 0 0 0 0 > 0 0 xen-pirq -msi-x eth0 > 89: 2805 0 0 419621 0 0 > 0 0 xen-pirq -msi-x eth0-TxRx-0 > 90: 2 0 0 0 0 0 > 1 0 xen-pirq -msi-x eth1 > 91: 435 171086 0 0 0 0 > 0 0 xen-pirq -msi-x eth1-TxRx-0 > > [2] > watch cat /proc/softirqs > CPU0 CPU1 CPU2 CPU3 CPU4 > CPU5 CPU6 CPU7 > HI: 1 0 0 0 0 > 0 0 0 > TIMER: 355339 786951 367148 266583 389591 > 357184 369577 374880 > NET_TX: 8 1081 1 743 1 > 1 223 5 > NET_RX: 5807 230818 2709 529755 3323 > 2598 2550 1878 > BLOCK: 8166 53 42 13 10 > 6563 21 6894 > IRQ_POLL: 0 0 0 0 0 > 0 0 0 > TASKLET: 103 6 9 85 0 > 0 0 0 > SCHED: 240428 100842 83166 51213 77824 > 77270 72074 72261 > HRTIMER: 0 0 0 0 0 > 0 0 0 > RCU: 135491 245345 139513 130591 140280 > 135094 141007 136063 > > > [3] > /proc/sys/kernel/sched_rt_runtime_us > 950000 > /proc/sys/kernel/sched_rt_period_us > 1000000 > > Best regards, > --Edwin > _______________________________________________ > Users mailing list: [email protected] > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
