Hi, On Tue, Nov 04, 2008 at 02:20:42PM -0500, Aaron Bush wrote: > Thanks for taking a look into this more. > > I have pulled down the 'tip' version of Linux-HA and copied over the new > ./lib/plugins/stonith/external/riloe into the system install path (did a > diff and there are significant changes). > Rebooted both nodes in this cluster. > Started same test again... Node 1 loses primary network connection to > LAN, thereby not able to get status or connect to the Stonith device > (ILO) for Node 2. > > The monitor process for the riloe appears to timeout and it is still > downhill from there (here are log entries from Node1 who lost the > network connection): > > > Nov 4 13:25:28 wwwlb01 kernel: bnx2: eth0 NIC Copper Link is Down > Nov 4 13:25:58 wwwlb01 lrmd: [8224]: WARN: cl_stonith_lb02:0:monitor > process (PID 9213) timed out (try 1). Killing with signal SIGTERM (15). > Nov 4 13:25:58 wwwlb01 lrmd: [9213]: ERROR: stonithd_receive_ops_result > failed.
This has been fixed: fix included in pacemaker 1.0. Though it makes no difference here. > Nov 4 13:25:58 wwwlb01 lrmd: [8224]: WARN: mapped the invalid return > code 254. > Nov 4 13:25:58 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM > operation cl_stonith_lb02:0_monitor_30000 (call=10, rc=1) complete > ... > Nov 4 13:25:59 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing > op=cl_stonith_lb02:0_stop_0 > key=5:3:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df) > Nov 4 13:25:59 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: stop > ... > Nov 4 13:25:59 wwwlb01 lrmd: [9898]: info: Try to stop STONITH resource > <rsc_id=cl_stonith_lb02:0> : Device=external/riloe > ... > Nov 4 13:26:00 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM > operation cl_stonith_lb02:0_monitor_30000 (call=10, rc=-2) Cancelled > Nov 4 13:26:00 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM > operation cl_stonith_lb02:0_stop_0 (call=12, rc=0) complete > Nov 4 13:26:01 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing > op=cl_stonith_lb02:0_start_0 > key=19:3:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df) > Nov 4 13:26:01 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: start > Nov 4 13:26:01 wwwlb01 lrmd: [9902]: info: Try to start STONITH > resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe > Nov 4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter > ilo_can_reset from StonithNVpair > Nov 4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter > ilo_protocol from StonithNVpair > Nov 4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter > ilo_powerdown_method from StonithNVpair > ... > Nov 4 13:26:13 wwwlb01 stonithd: [9904]: info: external_run_cmd: > Calling '/usr/lib64/stonith/plugins/external/riloe status' returned 256 > Nov 4 13:26:13 wwwlb01 stonithd: [8225]: WARN: start cl_stonith_lb02:0 > failed, because its hostlist is empty > Nov 4 13:26:13 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM > operation cl_stonith_lb02:0_start_0 (call=13, rc=1) complete > Nov 4 13:26:14 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing > op=cl_stonith_lb02:0_stop_0 > key=4:4:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df) > Nov 4 13:26:14 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: stop > Nov 4 13:26:14 wwwlb01 lrmd: [9917]: info: Try to stop STONITH resource > <rsc_id=cl_stonith_lb02:0> : Device=external/riloe > Nov 4 13:26:14 wwwlb01 stonithd: [8225]: notice: try to stop a resource > cl_stonith_lb02:0 who is not in started resource queue. > Nov 4 13:26:14 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM > operation cl_stonith_lb02:0_stop_0 (call=14, rc=0) complete > Nov 4 13:26:19 wwwlb01 cib: [8223]: info: cib_stats: Processed 44 > operations (3409.00us average, 0% utilization) in the last 10min > Nov 4 13:27:34 wwwlb01 kernel: bnx2: eth0 NIC Copper Link is Up, 100 > Mbps full duplex > Nov 4 13:27:35 wwwlb01 heartbeat: [5969]: info: Link > wwwlb02.microcenter.com:eth0 up. > > In playing with the riloe python script I assume that the call to > HTTPSConnection is hanging and then being later killed by lrmd. BTW, did you try to test your ilo device with the stonith program. Use -d to get debugging output. > It > looks like Python 2.6 added a timeout argument to the HTTPSConnection > call. The system is running 2.4.3 so I couldn't test it. I do see that > the socket timeout can be set like this: > socket.setdefaulttimeout(1) > I will follow this up by saying that my Python skills are very rusty. I'd prefer to have the upper layer (stonithd) timeout. Why do you think that this would help? > I am trying to find out what the expected behavior should be for a > timeout on a start or monitor command. A timeout on start is actually a timeout on monitor. Every stonith start includes a monitor operation. Otherwise, start should've been named "enable" for stonith resources. > Should Stonith agents follow the > OCF resource agent specs? OCF class != stonith class. If your stonith device is ok and you can use it with the stonith program successfully, then please file a bugzilla and attach a hb_report generated report. Thanks, Dejan > Thanks, > -ab > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Dejan > Muhamedagic > Sent: Tuesday, November 04, 2008 11:26 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Question on ILO stonith resource config and > restarting > > On Thu, Oct 30, 2008 at 03:07:24PM -0400, Aaron Bush wrote: > > Just realized that I only included the log entries from the node that > > was not experiencing a network disconnect. Attached are the log > entries > > from the node (01) that had the stonith resource running before the > > cable disconnect and looks like they provide some more useful > > information. Also included up through when the network cable was > > reconnected. > > The monitor operation on riloe failed. You should definitely > upgrade. > > Thanks, > > Dejan > > > > > -ab > > > > >> I have a 0.6 pacemaker/heartbeat cluster setup in a lab with > > resources > > >> as follows: > > >> > > >> Group-lvs(ordered): two primitives -> ocf/IPddr2 and > ocf/ldirectord. > > >> Clone-pingd: set to monitor a couple of Ips and used to set a > weight > > for > > >> where to run the LVS group. > > >> > > >> -- This is the area that I have a question on -- > > >> Clone-stonith-node1: HP ILO to shoot node1 > > >> Clone-stonith-node2: HP ILO to shoot node2 > > >> > > >> I read on the old linux-ha site that using a clone for ILO/stonith > > was > > >> the way to go. I'm not sure I see how this would work correctly > and > > be > > >> preferred over a standard resource. What I am confused about is > > this: > > >> the external/riloe stonith plugin only knows how to shoot one node > so > > > > > >Please make sure that you use the latest edition of > > >external/riloe. The previous one didn't work under all > > >circumstances. > > > > I am using the version that came with heartbeat-common-2.99.0-3.1 > > (according rpm -qf) > > > > To clear my current issue where the stonith resource was not started > > (and since this is still in the lab) I have rebooted both nodes to > start > > with a somewhat clean slate. I have attempted to grab some more > useful > > information from the logs on why the resource is not restarting from. > > Again I disconnect the LAN cable connecting a node to the rest of the > > network (a private HB channel is still available and the ILO is still > > up). I noticed these entries in the log: > > > > Oct 30 13:33:07 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing > > op=cl_stonith_lb02:0_start_0 > > key=18:7:0:efbdb124-d51a-4228-80bc-7a9464d7971a) > > Oct 30 13:33:07 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: > start > > Oct 30 13:33:07 wwwlb02 lrmd: [30788]: info: Try to start STONITH > > resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe > > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter > > ilo_can_reset from StonithNVpair > > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter > > ilo_protocol from StonithNVpair > > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter > > ilo_powerdown_method from StonithNVpair > > Oct 30 13:33:08 wwwlb02 heartbeat: [6202]: info: Link > > wwwlb01.microcenter.com:eth0 dead. > > Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_lstatus_callback: > > Status update: Ping node wwwlb01.microcenter.com now has status [dead] > > Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_nstatus_callback: > > Status update: Ping node wwwlb01.microcenter.com now has status [dead] > > Oct 30 13:33:12 wwwlb02 stonithd: [30790]: WARN: host list for > > cl_stonith_lb02:0 is empty, please fix your constraints > > Oct 30 13:33:12 wwwlb02 stonithd: [6413]: WARN: start > cl_stonith_lb02:0 > > failed, because its hostlist is empty > > Oct 30 13:33:12 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM > > operation cl_stonith_lb02:0_start_0 (call=12, rc=2) complete > > Oct 30 13:33:13 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: > stop > > Oct 30 13:33:13 wwwlb02 stonithd: [6413]: notice: try to stop a > resource > > cl_stonith_lb02:0 who is not in started resource queue. > > Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing > > op=cl_stonith_lb02:0_stop_0 > > key=1:8:0:efbdb124-d51a-4228-80bc-7a9464d7971a) > > Oct 30 13:33:13 wwwlb02 lrmd: [30842]: info: Try to stop STONITH > > resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe > > Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM > > operation cl_stonith_lb02:0_stop_0 (call=13, rc=0) complete > > > > > > > > Looks like I should specify from additional nvpair's for the ilo's. > The > > WARN host list empty message is what looks bad to me. Here is the cib > > section for the clone resource and the cib constraint for this > resource. > > Please let me know if there is some obvious errors in this > > configuration. This is the stonith resource that is to shoot the 02 > > node, intended to run on the 01 node (the 01 node was the node who had > a > > network cable disconnect). > > > > > > <clone id="cl_stonithset_lb02"> > > <meta_attributes id="cl_stonithset_lb02_meta_attrs"> > > <attributes> > > <nvpair id="cl_stonithset_lb02_metaattr_target_role" > > name="target_role" value="started"/> > > <nvpair id="cl_stonithset_lb02_metaattr_clone_max" > > name="clone_max" value="1"/> > > <nvpair id="cl_stonithset_lb02_metaattr_clone_node_max" > > name="clone_node_max" value="1"/> > > </attributes> > > </meta_attributes> > > <primitive id="cl_stonith_lb02" class="stonith" > > type="external/riloe" provider="heartbeat"> > > <instance_attributes id="cl_stonith_lb02_instance_attrs"> > > <attributes> > > <nvpair id="76163fb5-05ea-4cff-9786-a817774d8224" > > name="hostlist" value="wwwlb02.microcenter.com"/> > > <nvpair id="238e0158-81d3-48fd-879a-494c76d96b80" > > name="ilo_hostname" value="10.100.254.162"/> > > <nvpair id="82de3d5d-6f96-44f0-b98f-6eea75704b33" > > name="ilo_user" value="Administrator"/> > > <nvpair id="0fdef60a-fe62-4a0d-8f8f-d8da1d42082a" > > name="ilo_password" value="PASSWORD"/> > > </attributes> > > </instance_attributes> > > <operations> > > <op id="2a33ffe8-371f-4d08-a1ea-373135e85aeb" > > name="monitor" interval="30" timeout="20" start_delay="15" > > disabled="false" role="Started" on_fail="restart"/> > > <op id="4694393c-e89b-4371-af1c-a60d7f305e2f" > name="start" > > timeout="20" start_delay="0" disabled="false" role="Started" > > on_fail="restart"/> > > </operations> > > <meta_attributes id="cl_stonith_lb02:0_meta_attrs"> > > <attributes> > > <nvpair id="cl_stonith_lb02:0_metaattr_target_role" > > name="target_role" value="started"/> > > </attributes> > > </meta_attributes> > > </primitive> > > </clone> > > > > <constraints> > > <rsc_location id="location_on_lb01" rsc="cl_stonithset_lb02"> > > <rule id="prefered_location_on_lb01" score="INFINITY"> > > <expression attribute="#uname" > > id="c9e30917-97e2-4c35-86e7-9df6c7abc497" operation="eq" > > value="wwwlb01.microcenter.com"/> > > </rule> > > </rsc_location> > > </constraints> > > > > Thanks, > > -ab > > > > _______________________________________________ > > Pacemaker mailing list > > [email protected] > > http://list.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > _______________________________________________ > > Pacemaker mailing list > > [email protected] > > http://list.clusterlabs.org/mailman/listinfo/pacemaker > > > _______________________________________________ > Pacemaker mailing list > [email protected] > http://list.clusterlabs.org/mailman/listinfo/pacemaker > > > _______________________________________________ > Pacemaker mailing list > [email protected] > http://list.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list [email protected] http://list.clusterlabs.org/mailman/listinfo/pacemaker
