Hi,
I'm setting up a cluster with crm over heartbeat and I keep running into trouble with resources that are being called on nodes that don't have them. The setup is pretty simple, we have 4 nodes, two physical servers and two virtual servers (xen) in a asymmetric cluster. The xen servers have to run drbd(primary/secondary), a iscs-target and a third-deamon. (The physical server don't run anything yet, but wil have to mount stuff and start more xens later on. That's why they are in the cluster.)
This is the cib xml, pretty self explanatory I guess:

<cib>
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <attributes>
<nvpair id="cib-bootstrap-options-symmetric-cluster" name="symmetric-cluster" value="false"/>
        </attributes>
      </cluster_property_set>
    </crm_config>
    <resources>
      <master_slave id="ms-san">
        <meta_attributes id="ma-ms-san">
          <attributes>
            <nvpair id="ma-ms-san-1" name="clone_max" value="2"/>
            <nvpair id="ma-ms-san-2" name="clone_node_max" value="1"/>
            <nvpair id="ma-ms-san-3" name="master_max" value="1"/>
            <nvpair id="ma-ms-san-4" name="master_node_max" value="1"/>
            <nvpair id="ma-ms-san-5" name="notify" value="yes"/>
            <nvpair id="ma-ms-san-6" name="globally_unique" value="false"/>
          </attributes>
        </meta_attributes>
<primitive id="drbd-san" class="ocf" provider="heartbeat" type="drbd">
          <instance_attributes id="9002a0e4-28d2-4ca7-83d8-74cd7ac066e8">
            <attributes>
<nvpair name="drbd_resource" value="san" id="12d7d833-facc-4ac3-b296-e5cc59dcb4d4"/>
            </attributes>
          </instance_attributes>
          <operations>
<op name="monitor" interval="29s" timeout="10s" role="Master" id="714ea049-f14d-4b09-b856-8b374252e1de"/> <op name="monitor" interval="30s" timeout="10s" role="Slave" id="6c7ce46c-7fe5-4d22-8a31-eae6b2927711"/>
          </operations>
        </primitive>
      </master_slave>
      <group id="iscsi-cluster">
<primitive class="ocf" provider="heartbeat" type="IPaddr2" id="iscsi-target-ip">
          <instance_attributes id="ia-iscsi-target-ip">
            <attributes>
<nvpair id="ia-iscsi-target-ip-1" name="ip" value="10.0.3.5"/>
              <nvpair id="ia-iscsi-target-ip-2" name="nic" value="eth0"/>
            </attributes>
          </instance_attributes>
          <operations>
<op id="iscsi-target-ip-monitor-0" name="monitor" interval="20s" timeout="10s"/>
          </operations>
        </primitive>
        <primitive id="iscsi-target" class="lsb" type="iscsi-target"/>
      </group>
      <group id="puppet-cluster">
<primitive class="ocf" provider="heartbeat" type="IPaddr2" id="puppet-master-ip">
          <instance_attributes id="ia-puppet-master-ip">
            <attributes>
              <nvpair id="puppet-master-ip-1" name="ip" value="10.0.3.6"/>
              <nvpair id="puppet-master-ip-2" name="nic" value="eth0"/>
            </attributes>
          </instance_attributes>
          <operations>
<op id="puppet-master-ip-monitor-0" name="monitor" interval="60s" timeout="10s"/>
          </operations>
        </primitive>
        <primitive class="lsb" id="puppet-master" type="puppetmaster"/>
      </group>
    </resources>
    <constraints>
      <rsc_location id="san-placement-1" rsc="ms-san">
        <rule id="san-rule-1" score="INFINITY" boolean_op="or">
<expression id="exp-01" value="en1-r1-san1" attribute="#uname" operation="eq"/> <expression id="exp-02" value="en1-r1-san2" attribute="#uname" operation="eq"/>
        </rule>
      </rsc_location>
      <rsc_location id="iscsi-placement-1" rsc="iscsi-cluster">
        <rule id="iscsi-rule-1" score="INFINITY" boolean_op="or">
<expression id="exp-03" value="en1-r1-san1" attribute="#uname" operation="eq"/> <expression id="exp-04" value="en1-r1-san2" attribute="#uname" operation="eq"/>
        </rule>
      </rsc_location>
      <rsc_location id="puppet-placement-1" rsc="puppet-cluster">
        <rule id="puppet-rule-1" score="INFINITY" boolean_op="or">
<expression id="exp-05" value="en1-r1-san1" attribute="#uname" operation="eq"/> <expression id="exp-06" value="en1-r1-san2" attribute="#uname" operation="eq"/>
        </rule>
      </rsc_location>
<rsc_order id="iscsi_promotes_ms-san" from="iscsi-cluster" action="start" to="ms-san" to_action="promote" type="after"/> <rsc_colocation id="iscsi_on_san" to="ms-san" to_role="Master" from="iscsi-cluster" score="INFINITY"/>
    </constraints>
  </configuration>
</cib>

Oh yeah, the nodes are en1-r1-san1, en1-r1-san2 (virtual servers) and en1-r1-srv1, en1-r1-srv2 (physical servers)

A couple of problems arise when we start the cluster:
- CRM tries to run /etc/init.d/puppetmaster status and /etc/init.d/iscsi-target status on srv1 and srv2, which fails because they don't have these deamons installed. Because it's unsure if the deamons are running it doesn't start it on san1 or san2 - CRM looks for the drbdadm tool (probably as defined in the ocf file for drbd) on srv1 and srv2 with which, this fails and they get started on san1 and san2. The logs show me this: Sep 9 15:39:48 en1-r1-srv1 crmd: [8012]: info: do_lrm_rsc_op: Performing op=drbd-san:1_monitor_0 key=5:5:1e46411a-cc95-4104-abfa-9faf13eab862)
Sep  9 15:39:48 en1-r1-srv1 lrmd: [8009]: info: rsc:drbd-san:1: monitor
Sep 9 15:39:48 en1-r1-srv1 lrmd: [8009]: info: RA output: (drbd-san:1:monitor:stderr) which: no drbdadm in (/usr/ ... ) Sep 9 15:39:48 en1-r1-srv1 drbd[8088]: [8099]: ERROR: Setup problem: Couldn't find utility drbdadm Sep 9 15:39:48 en1-r1-srv1 crmd: [8012]: ERROR: process_lrm_event: LRM operation drbd-san:1_monitor_0 (call=7, rc=5) Error not installed


When I try to stop heartbeat on a node it drops in a deadlock because CRM tries to stop drbd-san:0 and drbd-san:1 on that node (with LRM I think). I get this in the logs on a stop, keeps repeating every minute: Sep 9 10:16:15 en1-r1-srv1 crmd: [7570]: info: do_shutdown: All subsystems stopped, continuing Sep 9 10:16:15 en1-r1-srv1 crmd: [7570]: ERROR: verify_stopped: Resource drbd-san:1 was active at shutdown. You may ignore this error if it is unmanaged. Sep 9 10:16:15 en1-r1-srv1 crmd: [7570]: ERROR: verify_stopped: Resource drbd-san:0 was active at shutdown. You may ignore this error if it is unmanaged.

The very first problem i've solved with a dummy script and some symlinks, now the whole cluster does start properly except for some 'can't find drbdadm' errors, but I can't stop it properly. I can call stop on it, wait for the 'can't stop drbd-san:*' errors and then clean those resources with crm_resouce -C and then heartbeat will go down.

ciblint is giving me some intresting errors too which seem related:

Anybody a clue what I'm doing wrong? I'm at a loss here.
I've considered moving to openAIS over heartbeat, but that can't really be the problem now can it? I'm running it all on centos 5, pacemaker is packaged with heartbeat 2.1.3 on it.

Any help, pointers or suggestions would be very much appreciated!
Cheers
Arthur Holstvoogd


_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker

Reply via email to