On Thu, Oct 30, 2008 at 03:07:24PM -0400, Aaron Bush wrote: > Just realized that I only included the log entries from the node that > was not experiencing a network disconnect. Attached are the log entries > from the node (01) that had the stonith resource running before the > cable disconnect and looks like they provide some more useful > information. Also included up through when the network cable was > reconnected.
The monitor operation on riloe failed. You should definitely upgrade. Thanks, Dejan > > -ab > > >> I have a 0.6 pacemaker/heartbeat cluster setup in a lab with > resources > >> as follows: > >> > >> Group-lvs(ordered): two primitives -> ocf/IPddr2 and ocf/ldirectord. > >> Clone-pingd: set to monitor a couple of Ips and used to set a weight > for > >> where to run the LVS group. > >> > >> -- This is the area that I have a question on -- > >> Clone-stonith-node1: HP ILO to shoot node1 > >> Clone-stonith-node2: HP ILO to shoot node2 > >> > >> I read on the old linux-ha site that using a clone for ILO/stonith > was > >> the way to go. I'm not sure I see how this would work correctly and > be > >> preferred over a standard resource. What I am confused about is > this: > >> the external/riloe stonith plugin only knows how to shoot one node so > > > >Please make sure that you use the latest edition of > >external/riloe. The previous one didn't work under all > >circumstances. > > I am using the version that came with heartbeat-common-2.99.0-3.1 > (according rpm -qf) > > To clear my current issue where the stonith resource was not started > (and since this is still in the lab) I have rebooted both nodes to start > with a somewhat clean slate. I have attempted to grab some more useful > information from the logs on why the resource is not restarting from. > Again I disconnect the LAN cable connecting a node to the rest of the > network (a private HB channel is still available and the ILO is still > up). I noticed these entries in the log: > > Oct 30 13:33:07 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing > op=cl_stonith_lb02:0_start_0 > key=18:7:0:efbdb124-d51a-4228-80bc-7a9464d7971a) > Oct 30 13:33:07 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: start > Oct 30 13:33:07 wwwlb02 lrmd: [30788]: info: Try to start STONITH > resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter > ilo_can_reset from StonithNVpair > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter > ilo_protocol from StonithNVpair > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter > ilo_powerdown_method from StonithNVpair > Oct 30 13:33:08 wwwlb02 heartbeat: [6202]: info: Link > wwwlb01.microcenter.com:eth0 dead. > Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_lstatus_callback: > Status update: Ping node wwwlb01.microcenter.com now has status [dead] > Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_nstatus_callback: > Status update: Ping node wwwlb01.microcenter.com now has status [dead] > Oct 30 13:33:12 wwwlb02 stonithd: [30790]: WARN: host list for > cl_stonith_lb02:0 is empty, please fix your constraints > Oct 30 13:33:12 wwwlb02 stonithd: [6413]: WARN: start cl_stonith_lb02:0 > failed, because its hostlist is empty > Oct 30 13:33:12 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM > operation cl_stonith_lb02:0_start_0 (call=12, rc=2) complete > Oct 30 13:33:13 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: stop > Oct 30 13:33:13 wwwlb02 stonithd: [6413]: notice: try to stop a resource > cl_stonith_lb02:0 who is not in started resource queue. > Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing > op=cl_stonith_lb02:0_stop_0 > key=1:8:0:efbdb124-d51a-4228-80bc-7a9464d7971a) > Oct 30 13:33:13 wwwlb02 lrmd: [30842]: info: Try to stop STONITH > resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe > Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM > operation cl_stonith_lb02:0_stop_0 (call=13, rc=0) complete > > > > Looks like I should specify from additional nvpair's for the ilo's. The > WARN host list empty message is what looks bad to me. Here is the cib > section for the clone resource and the cib constraint for this resource. > Please let me know if there is some obvious errors in this > configuration. This is the stonith resource that is to shoot the 02 > node, intended to run on the 01 node (the 01 node was the node who had a > network cable disconnect). > > > <clone id="cl_stonithset_lb02"> > <meta_attributes id="cl_stonithset_lb02_meta_attrs"> > <attributes> > <nvpair id="cl_stonithset_lb02_metaattr_target_role" > name="target_role" value="started"/> > <nvpair id="cl_stonithset_lb02_metaattr_clone_max" > name="clone_max" value="1"/> > <nvpair id="cl_stonithset_lb02_metaattr_clone_node_max" > name="clone_node_max" value="1"/> > </attributes> > </meta_attributes> > <primitive id="cl_stonith_lb02" class="stonith" > type="external/riloe" provider="heartbeat"> > <instance_attributes id="cl_stonith_lb02_instance_attrs"> > <attributes> > <nvpair id="76163fb5-05ea-4cff-9786-a817774d8224" > name="hostlist" value="wwwlb02.microcenter.com"/> > <nvpair id="238e0158-81d3-48fd-879a-494c76d96b80" > name="ilo_hostname" value="10.100.254.162"/> > <nvpair id="82de3d5d-6f96-44f0-b98f-6eea75704b33" > name="ilo_user" value="Administrator"/> > <nvpair id="0fdef60a-fe62-4a0d-8f8f-d8da1d42082a" > name="ilo_password" value="PASSWORD"/> > </attributes> > </instance_attributes> > <operations> > <op id="2a33ffe8-371f-4d08-a1ea-373135e85aeb" > name="monitor" interval="30" timeout="20" start_delay="15" > disabled="false" role="Started" on_fail="restart"/> > <op id="4694393c-e89b-4371-af1c-a60d7f305e2f" name="start" > timeout="20" start_delay="0" disabled="false" role="Started" > on_fail="restart"/> > </operations> > <meta_attributes id="cl_stonith_lb02:0_meta_attrs"> > <attributes> > <nvpair id="cl_stonith_lb02:0_metaattr_target_role" > name="target_role" value="started"/> > </attributes> > </meta_attributes> > </primitive> > </clone> > > <constraints> > <rsc_location id="location_on_lb01" rsc="cl_stonithset_lb02"> > <rule id="prefered_location_on_lb01" score="INFINITY"> > <expression attribute="#uname" > id="c9e30917-97e2-4c35-86e7-9df6c7abc497" operation="eq" > value="wwwlb01.microcenter.com"/> > </rule> > </rsc_location> > </constraints> > > Thanks, > -ab > > _______________________________________________ > Pacemaker mailing list > [email protected] > http://list.clusterlabs.org/mailman/listinfo/pacemaker > > _______________________________________________ > Pacemaker mailing list > [email protected] > http://list.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list [email protected] http://list.clusterlabs.org/mailman/listinfo/pacemaker
