Re: [ClusterLabs] DLM recovery stuck

David Teigland Thu, 09 Aug 2018 12:30:28 -0700

> If you mean dlm/clvmd_waiters, it's empty on all nodes.  Is there
> anything else to check?


I guess that might be the wrong thing to look at when it's recovery that's
blocked, my memory about this isn't great.  I think the clues to check for
recovery are mainly the dlm kernel messages and maybe:

  /sys/kernel/dlm/foo/recover_status
  (flags may indicate which message is being waited for)

  /sys/kernel/dlm/foo/recover_nodeid
  (which node a reply is needed from)

To eliminate userspace dlm_controld problems, look at dlm_controld debug
logs on each node and line up these steps from each of them:

clvmd check_ringid cluster 3724               (ringid needs to match)
clvmd start_kernel cg <N> member_count 6      (<N> will be different)
write "1" to "/sys/kernel/dlm/clvmd/control"
write "0" to "/sys/kernel/dlm/clvmd/event_done"

after this, follow the dlm kernel recovery messages, lining up the same
steps in parallel from each node.  The point at which they stop is the
recovery stage where a message didn't get through.  You can probably work
out which message between which nodes based on the sysfs files above.

_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] DLM recovery stuck

Reply via email to