> If you mean dlm/clvmd_waiters, it's empty on all nodes. Is there > anything else to check?
I guess that might be the wrong thing to look at when it's recovery that's blocked, my memory about this isn't great. I think the clues to check for recovery are mainly the dlm kernel messages and maybe: /sys/kernel/dlm/foo/recover_status (flags may indicate which message is being waited for) /sys/kernel/dlm/foo/recover_nodeid (which node a reply is needed from) To eliminate userspace dlm_controld problems, look at dlm_controld debug logs on each node and line up these steps from each of them: clvmd check_ringid cluster 3724 (ringid needs to match) clvmd start_kernel cg <N> member_count 6 (<N> will be different) write "1" to "/sys/kernel/dlm/clvmd/control" write "0" to "/sys/kernel/dlm/clvmd/event_done" after this, follow the dlm kernel recovery messages, lining up the same steps in parallel from each node. The point at which they stop is the recovery stage where a message didn't get through. You can probably work out which message between which nodes based on the sysfs files above. _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
