Re: [Pacemaker] Question on ILO stonith resource config and restarting

Takenaka Kazuhiro Tue, 04 Nov 2008 18:31:56 -0800

Hi Aaron.

First of all, what I say in this message is bases on my experiences
when I dealt with Heartbeat 2.1.3 and external/igmrsa-telnet
several months ago. So you should be careful in the case you apply
what I say to your problems.


> What I am confused about is this:
> the external/riloe stonith plugin only knows how to shoot one node so
> why would you want to run it as a clone since each external/riloe is
> configured differently.
...
> I then noticed that my ILO clones were starting on the 'wrong' nodes.
> As in the stonith resource to kill node 2 was actually running on node
> 2; which is pointless if node 2 locks up.  So I added resource
> constraints to force the stonith clone to stay on a node that was not
> the one to be shot.  This seemed to work well.

I think external/riloe isn't made to run as a clone resource
just same as external/ibmrsa-telnet. They are made to run in
a similar way to usual resource agents.
See the attached sample configuration.

> The next issue I have is that when I disconnect the LAN cable on a
> single node that connects it to the rest of the network the clone
> stonith monitor will fail since it can't connect to the other nodes ILO
> for status.  After some time (minutes let's say) I reconnect the LAN
> cable but never see the clone stonith come back to life, just stays
> failed.  What should I be looking at to make sure that the clone stonith
> restarts properly.

I presume you want to know how to recover a monitor failure of
a stonith plugin. Is my guess write? If so, what you must do is
run the following commands.

# crm_failcount -D -r prmStonithN2 -U node01
# crm_resource -C -r prmStonithN2 -H node02

!!! Caution !!!
Some options might be changed in the latest Pacemaker.

Aaron Bush wrote:

I have a 0.6 pacemaker/heartbeat cluster setup in a lab with resources
as follows:

Group-lvs(ordered): two primitives -> ocf/IPddr2 and ocf/ldirectord.
Clone-pingd: set to monitor a couple of Ips and used to set a weight for
where to run the LVS group.

-- This is the area that I have a question on --
Clone-stonith-node1: HP ILO to shoot node1
Clone-stonith-node2: HP ILO to shoot node2

I read on the old linux-ha site that using a clone for ILO/stonith was
the way to go.  I'm not sure I see how this would work correctly and be
preferred over a standard resource.  What I am confused about is this:
the external/riloe stonith plugin only knows how to shoot one node so
why would you want to run it as a clone since each external/riloe is
configured differently.  I went ahead and configured the riloe's as
clones feeling that the docs are correct and that the reason would
become obvious to me later.  (I also saw a similar post with no
response:
http://www.gossamer-threads.com/lists/linuxha/users/35685?nohighlight=1#
35685)

I then noticed that my ILO clones were starting on the 'wrong' nodes.
As in the stonith resource to kill node 2 was actually running on node
2; which is pointless if node 2 locks up.  So I added resource
constraints to force the stonith clone to stay on a node that was not
the one to be shot.  This seemed to work well.

The next issue I have is that when I disconnect the LAN cable on a
single node that connects it to the rest of the network the clone
stonith monitor will fail since it can't connect to the other nodes ILO
for status.  After some time (minutes let's say) I reconnect the LAN
cable but never see the clone stonith come back to life, just stays
failed.  What should I be looking at to make sure that the clone stonith
restarts properly.

Any advice on how to more properly setup an HP ILO stonith in this
scenario would be greatly appreciated.  (I can see where a clone stonith
would be useful in a large cluster of n>2 nodes since all nodes could
have a chance to shoot a failed node and maybe this is the reason for
cloned stonith with ILO?  Basically in a cluster of N nodes each node
would be running N-1 stonith resources, ready to shoot a dead node.)

Thanks in advance,
-ab


_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker

--
Takenaka Kazuhiro <[EMAIL PROTECTED]>
NTT Open Source Software Center

 <cib admin_epoch="0" epoch="0" num_updates="0">
   <configuration>
     <crm_config>
... snip ...
     <resources>
... write configurations for resources that you really want to use ...

<!-- the configurations for the stonith plugin that shoots node01 from node02 
-->
       <primitive id="prmStonithN1" class="stonith" 
type="external/ibmrsa-telnet" provider="heartbeat" 
resource_stickiness="INFINITY">
         <operations>
           <op name="monitor" interval="20" timeout="300" prereq="nothing" 
id="prmStonithN1:monitor"/>
           <op name="start" timeout="180" id="prmStonithN1:start"/>
           <op name="stop" timeout="180" id="prmStonithN1:stop"/>
         </operations>
         <instance_attributes id="prmStonithN1:attr">
           <attributes>
             <nvpair id="prmStonithN1:nodename" name="nodename" value="node01"/>
             <nvpair id="prmStonithN1:ipaddr" name="ip_address" 
value="192.168.16.126"/>
             <nvpair id="prmStonithN1:userid" name="username" value="USERID"/>
             <nvpair id="prmStonithN1:passwd" name="password" value="***"/>
           </attributes>
         </instance_attributes>
       </primitive>

<!-- the configurations for the stonith plugin that shoots node02 from node01 
-->
       <primitive id="prmStonithN2" class="stonith" 
type="external/ibmrsa-telnet" provider="heartbeat" 
resource_stickiness="INFINITY">
         <operations>
           <op name="monitor" interval="20" timeout="300" prereq="nothing" 
id="prmStonithN2:monitor"/>
           <op name="start" timeout="180" id="prmStonithN2:start"/>
           <op name="stop" timeout="180" id="prmStonithN2:stop"/>
         </operations>
         <instance_attributes id="prmStonithN2:attr">
           <attributes>
             <nvpair id="prmStonithN2:nodename" name="nodename" value="node02"/>
             <nvpair id="prmStonithN2:ipaddr" name="ip_address" 
value="192.168.16.127"/>
             <nvpair id="prmStonithN2:userid" name="username" value="USERID"/>
             <nvpair id="prmStonithN2:passwd" name="password" value="###"/>
           </attributes>
         </instance_attributes>
       </primitive>
     </resources>

     <constraints>
... write constraints for resources that you really want to use ...

<!-- the constraints to keep the stonith plugin that shoots node01 at node02 -->
       <rsc_location id="prmStonithN1_hates_node01" rsc="prmStonithN1">
         <rule id="prmStonithN1_hates_node01_rule" score="-INFINITY">
           <expression attribute="#uname" operation="eq" value="node01" 
id="prmStonithN1_hates_N1_expr"/>
         </rule>
       </rsc_location>

<!-- the constraints to keep the stonith plugin that shoots node02 at node01 -->
       <rsc_location id="prmStonithN2_hates_node02" rsc="prmStonithN2">
         <rule id="prmStonithN2_hates_node02_rule" score="-INFINITY">
           <expression attribute="#uname" operation="eq" value="node02" 
id="prmStonithN2_hates_N2_expr"/>
         </rule>
       </rsc_location>

     </constraints>
   </configuration>
   <status/>
 </cib>

_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker

Re: [Pacemaker] Question on ILO stonith resource config and restarting

Reply via email to