On 06/02/2014 05:52, Vladislav Bogdanov wrote:
Hi,I bet your problem comes from the LSB clvmd init script. Here is what it does do: =========== ... clustered_vgs() { ${lvm_vgdisplay} 2>/dev/null | \ awk 'BEGIN {RS="VG Name"} {if (/Clustered/) print $1;}' } clustered_active_lvs() { for i in $(clustered_vgs); do ${lvm_lvdisplay} $i 2>/dev/null | \ awk 'BEGIN {RS="LV Name"} {if (/[^N^O^T] available/) print $1;}' done } rh_status() { status $DAEMON } ... case "$1" in ... status) rh_status rtrn=$? if [ $rtrn = 0 ]; then cvgs="$(clustered_vgs)" echo Clustered Volume Groups: ${cvgs:-"(none)"} clvs="$(clustered_active_lvs)" echo Active clustered Logical Volumes: ${clvs:-"(none)"} fi ... esac exit $rtrn ========= So, it not only looks for status of daemon itself, but also tries to list volume groups. And this operation is blocked because fencing is still in progress, and the whole cLVM thing (as well as DLM itself and all other dependent services) is frozen. So your resource timeouts in monitor operation, and then pacemaker asks it to stop (unless you have on-fail=fence). Anyways, there is a big chance that stop will fail too, and that leads again to fencing. cLVM is very fragile in my opinion (although newer versions running on corosync2 stack seem to be much better). And it is probably still doesn't work well when managed by pacemaker in CMAN-based clusters, because it blocks globally if any node in the whole cluster is online at the cman layer but doesn't run clvmd (I checked last time with .99). And that was the same for all stacks, until was fixed for corosync (only 2?) stack recently. The problem with that is that you cannot just stop pacemaker on one node (f.e. for maintenance), you should immediately stop cman as well (or run clvmd in cman'ish way) - cLVM freezes on another node. This should be easily fixable in clvmd code, but nobody cares.
Thanks for the explanation, this is interresting for me as I need a volume manager in the cluster to manager the shared file systems in case I need to resize for some reason. I think I may be coming up against something similar now that I am testing cman outside of the cluster, even though I have cman/clvmd enabled outside pacemaker the clvmd daemon still hangs even when the 2nd node has been rebooted due to a fence operation, when it (node 2) reboots, cman & clvmd starts, I can see both nodes as members using cman_tool, but clvmd still seems to have an issue, it just hangs, I cant see off-hand if dlm still thinks pacemaker is in the fence operation (or if it has already returned true for successful fence). I am still gathering logs and will post back to this thread once I have all my logs from yesterday and this morning.
I dont suppose there is another volume manager available that would be cluster aware that anyone is aware of?
Increasing timeout for LSB clvmd resource probably wont help you, because blocked (because of DLM waits for fencing) LVM operations iirc never finish. You may want to search for clvmd OCF resource-agent, it is available for SUSE I think. Although it is not perfect, it should work much better for you
I will have a look around for this clvmd ocf agent, and see what is involverd in getting it to work on CentOS 6.5 if I dont have any success with the current recommendation for running it outside of pacemaker control.
_______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
