On 09/11/2017 12:32 PM, Jan Friesse wrote: > Ferenc, > >> [email protected] (Ferenc Wágner) writes: >> >>> Jan Friesse <[email protected]> writes: >>> >>>> [email protected] writes: >>>> >>>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day >>>>> (in August; in May, it happened 0-2 times a day only, it's slowly >>>>> ramping up): >>>>> >>>>> vhbl08 corosync[3687]: [TOTEM ] A processor failed, forming new >>>>> configuration. >>>>> vhbl03 corosync[3890]: [TOTEM ] A processor failed, forming new >>>>> configuration. >>>>> vhbl07 corosync[3805]: [MAIN ] Corosync main process was not >>>>> scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider >>>>> token timeout increase. >>>> >>>> ^^^ This is main problem you have to solve. It usually means that >>>> machine is too overloaded. It is happening quite often when corosync >>>> is running inside VM where host machine is unable to schedule regular >>>> VM running. >>> >>> After some extensive tracing, I think the problem lies elsewhere: my >>> IPMI watchdog device is slow beyond imagination.
Just for my understanding: You are using watchdog-handling in corosync? Klaus >> >> Confirmed: setting watchdog_device: off cluster wide got rid of the >> above warnings. >> > > Yep, good you found the issue. This is perfectly possible if ioctl > blocks. > >>> Its ioctl operations can take seconds, starving all other functions. >>> At least, it seems to block the main thread of Corosync. Is this a >>> plausible scenario? Corosync has two threads, what are their roles? > > First (main) thread is basically doing almost everything. There is a > main loop (epoll) I've described in previous mail. > > Second thread is created by libqb and it's used only for logging. This > is to prevent blocking of corosync when syslog/file log write blocks > for some reason. It means some messages may be lost but it's still > better than blocking. > > Back to problem you have. It's definitively HW issue but I'm thinking > how to solve it in software. Right now, I can see two ways: > 1. Set dog FD to be non blocking right at the end of setup_watchdog - > This is proffered but I'm not sure if it's really going to work. > 2. Create thread which makes sure to tackle wd regularly. > > Regards, > Honza > > _______________________________________________ > Users mailing list: [email protected] > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
