[Pacemaker] CRM checks resource on bad-nodes (in a non-symmetric cluster)

Arthur Holstvoogd Tue, 09 Sep 2008 09:13:00 -0700

Hi,

I'm setting up a cluster with crm over heartbeat and I keep running intotrouble with resources that are being called on nodes that don't havethem. The setup is pretty simple, we have 4 nodes, two physical serversand two virtual servers (xen) in a asymmetric cluster. The xen servershave to run drbd(primary/secondary), a iscs-target and a third-deamon.(The physical server don't run anything yet, but wil have to mount stuffand start more xens later on. That's why they are in the cluster.)

This is the cib xml, pretty self explanatory I guess:


<cib>
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <attributes>

<nvpair id="cib-bootstrap-options-symmetric-cluster"name="symmetric-cluster" value="false"/>

        </attributes>
      </cluster_property_set>
    </crm_config>
    <resources>
      <master_slave id="ms-san">
        <meta_attributes id="ma-ms-san">
          <attributes>
            <nvpair id="ma-ms-san-1" name="clone_max" value="2"/>
            <nvpair id="ma-ms-san-2" name="clone_node_max" value="1"/>
            <nvpair id="ma-ms-san-3" name="master_max" value="1"/>
            <nvpair id="ma-ms-san-4" name="master_node_max" value="1"/>
            <nvpair id="ma-ms-san-5" name="notify" value="yes"/>
            <nvpair id="ma-ms-san-6" name="globally_unique" value="false"/>
          </attributes>
        </meta_attributes>

<primitive id="drbd-san" class="ocf" provider="heartbeat"type="drbd">

          <instance_attributes id="9002a0e4-28d2-4ca7-83d8-74cd7ac066e8">
            <attributes>

<nvpair name="drbd_resource" value="san"id="12d7d833-facc-4ac3-b296-e5cc59dcb4d4"/>

            </attributes>
          </instance_attributes>
          <operations>

<op name="monitor" interval="29s" timeout="10s"role="Master" id="714ea049-f14d-4b09-b856-8b374252e1de"/><op name="monitor" interval="30s" timeout="10s"role="Slave" id="6c7ce46c-7fe5-4d22-8a31-eae6b2927711"/>

          </operations>
        </primitive>
      </master_slave>
      <group id="iscsi-cluster">

<primitive class="ocf" provider="heartbeat" type="IPaddr2"id="iscsi-target-ip">

          <instance_attributes id="ia-iscsi-target-ip">
            <attributes>

<nvpair id="ia-iscsi-target-ip-1" name="ip"value="10.0.3.5"/>

              <nvpair id="ia-iscsi-target-ip-2" name="nic" value="eth0"/>
            </attributes>
          </instance_attributes>
          <operations>

<op id="iscsi-target-ip-monitor-0" name="monitor"interval="20s" timeout="10s"/>

          </operations>
        </primitive>
        <primitive id="iscsi-target" class="lsb" type="iscsi-target"/>
      </group>
      <group id="puppet-cluster">

<primitive class="ocf" provider="heartbeat" type="IPaddr2"id="puppet-master-ip">

          <instance_attributes id="ia-puppet-master-ip">
            <attributes>
              <nvpair id="puppet-master-ip-1" name="ip" value="10.0.3.6"/>
              <nvpair id="puppet-master-ip-2" name="nic" value="eth0"/>
            </attributes>
          </instance_attributes>
          <operations>

<op id="puppet-master-ip-monitor-0" name="monitor"interval="60s" timeout="10s"/>

          </operations>
        </primitive>
        <primitive class="lsb" id="puppet-master" type="puppetmaster"/>
      </group>
    </resources>
    <constraints>
      <rsc_location id="san-placement-1" rsc="ms-san">
        <rule id="san-rule-1" score="INFINITY" boolean_op="or">

<expression id="exp-01" value="en1-r1-san1"attribute="#uname" operation="eq"/><expression id="exp-02" value="en1-r1-san2"attribute="#uname" operation="eq"/>

        </rule>
      </rsc_location>
      <rsc_location id="iscsi-placement-1" rsc="iscsi-cluster">
        <rule id="iscsi-rule-1" score="INFINITY" boolean_op="or">

<expression id="exp-03" value="en1-r1-san1"attribute="#uname" operation="eq"/><expression id="exp-04" value="en1-r1-san2"attribute="#uname" operation="eq"/>

        </rule>
      </rsc_location>
      <rsc_location id="puppet-placement-1" rsc="puppet-cluster">
        <rule id="puppet-rule-1" score="INFINITY" boolean_op="or">

<expression id="exp-05" value="en1-r1-san1"attribute="#uname" operation="eq"/><expression id="exp-06" value="en1-r1-san2"attribute="#uname" operation="eq"/>

        </rule>
      </rsc_location>

<rsc_order id="iscsi_promotes_ms-san" from="iscsi-cluster"action="start" to="ms-san" to_action="promote" type="after"/><rsc_colocation id="iscsi_on_san" to="ms-san" to_role="Master"from="iscsi-cluster" score="INFINITY"/>

    </constraints>
  </configuration>
</cib>

Oh yeah, the nodes are en1-r1-san1, en1-r1-san2 (virtual servers) anden1-r1-srv1, en1-r1-srv2 (physical servers)


A couple of problems arise when we start the cluster:

- CRM tries to run /etc/init.d/puppetmaster status and/etc/init.d/iscsi-target status on srv1 and srv2, which fails becausethey don't have these deamons installed. Because it's unsure if thedeamons are running it doesn't start it on san1 or san2- CRM looks for the drbdadm tool (probably as defined in the ocf filefor drbd) on srv1 and srv2 with which, this fails and they get startedon san1 and san2. The logs show me this:Sep 9 15:39:48 en1-r1-srv1 crmd: [8012]: info: do_lrm_rsc_op:Performing op=drbd-san:1_monitor_0key=5:5:1e46411a-cc95-4104-abfa-9faf13eab862)

Sep  9 15:39:48 en1-r1-srv1 lrmd: [8009]: info: rsc:drbd-san:1: monitor

Sep 9 15:39:48 en1-r1-srv1 lrmd: [8009]: info: RA output:(drbd-san:1:monitor:stderr) which: no drbdadm in (/usr/ ... )Sep 9 15:39:48 en1-r1-srv1 drbd[8088]: [8099]: ERROR: Setup problem:Couldn't find utility drbdadmSep 9 15:39:48 en1-r1-srv1 crmd: [8012]: ERROR: process_lrm_event: LRMoperation drbd-san:1_monitor_0 (call=7, rc=5) Error not installed

When I try to stop heartbeat on a node it drops in a deadlock becauseCRM tries to stop drbd-san:0 and drbd-san:1 on that node (with LRM Ithink). I get this in the logs on a stop, keeps repeating every minute:Sep 9 10:16:15 en1-r1-srv1 crmd: [7570]: info: do_shutdown: Allsubsystems stopped, continuingSep 9 10:16:15 en1-r1-srv1 crmd: [7570]: ERROR: verify_stopped:Resource drbd-san:1 was active at shutdown. You may ignore this errorif it is unmanaged.Sep 9 10:16:15 en1-r1-srv1 crmd: [7570]: ERROR: verify_stopped:Resource drbd-san:0 was active at shutdown. You may ignore this errorif it is unmanaged.

The very first problem i've solved with a dummy script and somesymlinks, now the whole cluster does start properly except for some'can't find drbdadm' errors, but I can't stop it properly. I can callstop on it, wait for the 'can't stop drbd-san:*' errors and then cleanthose resources with crm_resouce -C and then heartbeat will go down.


ciblint is giving me some intresting errors too which seem related:

Anybody a clue what I'm doing wrong? I'm at a loss here.

I've considered moving to openAIS over heartbeat, but that can't reallybe the problem now can it?I'm running it all on centos 5, pacemaker is packaged with heartbeat2.1.3 on it.


Any help, pointers or suggestions would be very much appreciated!
Cheers
Arthur Holstvoogd


_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker

[Pacemaker] CRM checks resource on bad-nodes (in a non-symmetric cluster)

Reply via email to