Re: [Pacemaker] node1 fencing itself after node2 being fenced

Asgaroth Fri, 07 Feb 2014 03:27:35 -0800

On 06/02/2014 05:52, Vladislav Bogdanov wrote:

Hi,


I bet your problem comes from the LSB clvmd init script.
Here is what it does do:

===========
...
clustered_vgs() {
     ${lvm_vgdisplay} 2>/dev/null | \
         awk 'BEGIN {RS="VG Name"} {if (/Clustered/) print $1;}'
}

clustered_active_lvs() {
     for i in $(clustered_vgs); do
         ${lvm_lvdisplay} $i 2>/dev/null | \
         awk 'BEGIN {RS="LV Name"} {if (/[^N^O^T] available/) print $1;}'
     done
}

rh_status() {
     status $DAEMON
}
...
case "$1" in
...
   status)
     rh_status
     rtrn=$?
     if [ $rtrn = 0 ]; then
         cvgs="$(clustered_vgs)"
         echo Clustered Volume Groups: ${cvgs:-"(none)"}
         clvs="$(clustered_active_lvs)"
         echo Active clustered Logical Volumes: ${clvs:-"(none)"}
     fi
...
esac

exit $rtrn
=========

So, it not only looks for status of daemon itself, but also tries to
list volume groups. And this operation is blocked because fencing is
still in progress, and the whole cLVM thing (as well as DLM itself and
all other dependent services) is frozen. So your resource timeouts in
monitor operation, and then pacemaker asks it to stop (unless you have
on-fail=fence). Anyways, there is a big chance that stop will fail too,
and that leads again to fencing. cLVM is very fragile in my opinion
(although newer versions running on corosync2 stack seem to be much
better). And it is probably still doesn't work well when managed by
pacemaker in CMAN-based clusters, because it blocks globally if any node
in the whole cluster is online at the cman layer but doesn't run clvmd
(I checked last time with .99). And that was the same for all stacks,
until was fixed for corosync (only 2?) stack recently. The problem with
that is that you cannot just stop pacemaker on one node (f.e. for
maintenance), you should immediately stop cman as well (or run clvmd in
cman'ish way) - cLVM freezes on another node. This should be easily
fixable in clvmd code, but nobody cares.

Thanks for the explanation, this is interresting for me as I need avolume manager in the cluster to manager the shared file systems in caseI need to resize for some reason. I think I may be coming up againstsomething similar now that I am testing cman outside of the cluster,even though I have cman/clvmd enabled outside pacemaker the clvmd daemonstill hangs even when the 2nd node has been rebooted due to a fenceoperation, when it (node 2) reboots, cman & clvmd starts, I can see bothnodes as members using cman_tool, but clvmd still seems to have anissue, it just hangs, I cant see off-hand if dlm still thinks pacemakeris in the fence operation (or if it has already returned true forsuccessful fence). I am still gathering logs and will post back to thisthread once I have all my logs from yesterday and this morning.

I dont suppose there is another volume manager available that would becluster aware that anyone is aware of?


Increasing timeout for LSB clvmd resource probably wont help you,
because blocked (because of DLM waits for fencing) LVM operations iirc
never finish.

You may want to search for clvmd OCF resource-agent, it is available for
SUSE I think. Although it is not perfect, it should work much better for
you

I will have a look around for this clvmd ocf agent, and see what isinvolverd in getting it to work on CentOS 6.5 if I dont have any successwith the current recommendation for running it outside of pacemaker control.



_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] node1 fencing itself after node2 being fenced

Reply via email to