On Thu, Aug 09, 2018 at 06:11:48PM +0200, Ferenc Wágner wrote: > Hi David, > > Almost ten years ago you requested more info in a similar case, let's > see if we can get further now!
Hi, the usual cause is that a network message from the dlm has been lost/dropped/missed. The dlm can't recover from that, which is clearly a weak point in the design. There may be some new development coming along to finally improve that. One way you can confirm this is to check if the dlm on one or more nodes is waiting for a message that's not arriving. Often you'll see an entry in the dlm "waiters" debugfs file corresponding to a response that's being waited on. Another red flag is kernel messages from a driver indicating some network hickup at the time things hung. I can't say if these messages you sent happened at the right time, or if they even correspond to the dlm interface, but it's worth checking as a possible explanation: [ 137.207059] be2net 0000:05:00.0 enp5s0f0: Link is Up [ 137.252901] be2net 0000:05:00.1 enp5s0f1: Link is Up [ 153.886619] connection2:0: detected conn error (1011) _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
