On Fri, Nov 14, 2008 at 16:25, Raoul Bhatia [IPAX] <[EMAIL PROTECTED]> wrote: > dear list, > > i again encounter my cluster malfunctioning. > > i upgraded my configuration to use pam-ldap and libnss-ldap, which > did not work out of the box. > > pacemaker tried to recover some errors and then stonithed both > hosts. > > i now have only got: >> Node: wc01 (31de4ab3-2d05-476e-8f9a-627ad6cd94ca): online >> Node: wc02 (f36760d8-d84a-46b2-b452-4c8cac8b3396): online >> >> Clone Set: clone_nfs-common >> Resource Group: group_nfs-common:0 >> nfs-common:0 (lsb:nfs-common): Started wc02 >> Resource Group: group_nfs-common:1 >> nfs-common:1 (lsb:nfs-common): Started wc01 >> Clone Set: DoFencing >> stonith_rackpdu:0 (stonith:external/rackpdu): Started wc02 >> stonith_rackpdu:1 (stonith:external/rackpdu): Started wc01 > > and pacemaker seems happy :) > > (please note, that i normally have several groups, clones and > master/slave resources active. > > i took a look at pe-warn-12143.bz2 but do not know how to interpret the > three different threads i see in the corresponding .dot file. > > can any1 explain how i may debug such a deadlock?
What exactly were you trying to determine here? Running through ptest, I see two major areas of concern: Clones clone_nfs-common contains non-OCF resource nfs-common:0 and so can only be used as an anonymous clone. Set the globally-unique meta attribute to false Clones clone_mysql-proxy contains non-OCF resource mysql-proxy:0 and so can only be used as an anonymous clone. Set the globally-unique meta attribute to false and Hard error - drbd_www:0_monitor_0 failed with rc=4: Preventing drbd_www:0 from re-starting on wc01 Hard error - drbd_www:1_monitor_0 failed with rc=4: Preventing drbd_www:1 from re-starting on wc01 Hard error - drbd_mysql:1_monitor_0 failed with rc=4: Preventing drbd_mysql:1 from re-starting on wc01 Hard error - drbd_mysql:0_monitor_0 failed with rc=4: Preventing drbd_mysql:0 from re-starting on wc01 If drbd is failing, then I can imagine that would prevent much of the rest of the cluster from being started. Also, you might want to look into: Operation nfs-kernel-server_monitor_0 found resource nfs-kernel-server active on wc01 Operation nfs-common:0_monitor_0 found resource nfs-common:0 active on wc01 Having said all that, I just looked at the config and all of the above is more than likely caused by the issue we spoke about the other day - loading 0.6 config fragments into a 1.0 cluster (where all the meta attributes now have dashes instead of underscores) _______________________________________________ Pacemaker mailing list [email protected] http://list.clusterlabs.org/mailman/listinfo/pacemaker
