On 03/07/2016 02:03 PM, Ferenc Wágner wrote: > Ken Gaillot <[email protected]> writes: > >> On 03/07/2016 07:31 AM, Ferenc Wágner wrote: >> >>> 12:55:13 vhbl07 crmd[8484]: notice: Transition aborted by >>> vm-eiffel_monitor_60000 'create' on vhbl05: Foreign event >>> (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98, >>> source=process_graph_event:600, 0) >> >> That means the action was initiated by a different node (the previous DC >> presumably), so the new DC wants to recalculate everything. > > Time travel was sort of possible in that situation, and recurring > monitor operations are not logged, so this is indeed possible. The main > thing is that it wasn't mishandled. > >>> recovery actions turned into start actions for the resources stopped >>> during the previous transition. However, almost all other recovery >>> actions just disappeared without any comment. This was actually >>> correct, but I really wonder why the cluster decided to paper over >>> the previous monitor operation timeouts. Maybe the operations >>> finished meanwhile and got accounted somehow, just not logged? >> >> I'm not sure why the PE decided recovery was not necessary. Operation >> results wouldn't be accepted without being logged. > > At which logging level? I can't see recurring monitor operation logs in > syslog (at default logging level: notice) nor in /var/log/pacemaker.log > (which contains info level messages as well). > > However, the info level logs contain more "Transition aborted" lines, as > if only the first of them got logged with notice level. This would make > sense, since the later ones don't make any difference on an already > aborted transition, so they aren't that important. And in fact such > lines were suppressed from the syslog I checked first, for example: > > 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: Diff: --- > 0.613.120 2 > 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: Diff: +++ > 0.613.121 (null) > 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: + /cib: > @num_updates=121 > 12:55:39 [8479] vhbl07 cib: info: cib_perform_op: ++ > /cib/status/node_state[@id='167773707']/lrm[@id='167773707']/lrm_resources/lrm_resource[@id='vm-elm']: > <lrm_rsc_op id="vm-elm_monitor_60000" operation_key="vm-elm_monitor_60000" > operation="monitor" crm-debug-origin="do_update_resource" > crm_feature_set="3.0.10" > transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > transition-magic="0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > on_node="vhbl05" call-id="645" rc-code="0" op-st > 12:55:39 [8479] vhbl07 cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=vhbl05/crmd/362, version=0.613.121) > 12:55:39 [8484] vhbl07 crmd: info: abort_transition_graph: > Transition aborted by vm-elm_monitor_60000 'create' on vhbl05: Foreign event > (magic=0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.121, > source=process_graph_event:600, 0) > 12:55:39 [8484] vhbl07 crmd: info: process_graph_event: > Detected action (0.473) vm-elm_monitor_60000.645=ok: initiated by a different > node > > I can very much imagine this cancelling the FAILED state induced by a > monitor timeout like: > > 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ > <lrm_resource id="vm-elm" > type="TransientDomain" class="ocf" provider="niif"> > 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ > <lrm_rsc_op id="vm-elm_last_failure_0" > operation_key="vm-elm_monitor_60000" operation="monitor" > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" > transition-key="473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > transition-magic="2:1;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > on_node="vhbl05" call-id="645" rc-code="1" op-status="2" interval="60000" > last-rc-change="1456833279" exe > 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ > <lrm_rsc_op id="vm-elm_last_0" > operation_key="vm-elm_start_0" operation="start" > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" > transition-key="472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > transition-magic="0:0;472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7" > on_node="vhbl05" call-id="602" rc-code="0" op-status="0" interval="0" > last-run="1456091121" last-rc-change="1456091121" e > 12:54:52 [8479] vhbl07 cib: info: cib_perform_op: ++ > </lrm_resource> > > The transition-keys match, does this mean that the above is a late > result from the monitor operation which was considered timed-out > previously? How did it reach vhbl07, if the DC at that time was vhbl03? > >> The pe-input files from the transitions around here should help. > > They are available. What shall I look for?
It's not the most user-friendly of tools, but crm_simulate can show how the cluster would react to each transition: crm_simulate -Sx $FILE.bz2 Adding -D $FILE.dot will output to a dot file, then dot $FILE.dot -Tpng > $FILE.png will produce a graphic of the transition, which can be interpreted the same way as http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-config-testing-changes >>> Basically, the cluster responded beyond my expectations, sparing lots of >>> unnecessary recoveries or fencing. I'm happy, thanks for this wonderful >>> software! But I'm left with these "Processing failed op monitor" >>> warnings emitted every 15 minutes (timer pops). Is it safe and clever >>> to cleanup the affected resources? Would that get rid of them without >>> invoking other effects, like recoveries for example? >> >> That's normal; it's how the cluster maintains the effect of a failure >> that has not been cleared. The logs can be confusing, because it's not >> apparent from that message alone whether the failure is new or old. > > Ah, do you mean that these are the same thing that appears after "Failed > Actions:" at the end of the crm_mon output? They certainly match, and > the logs are confusing indeed. Exactly >> Cleaning up the resource will end the failure condition, so what happens >> next depends on the configuration and state of the cluster. If the >> failure was preventing a preferred node from running the resource, the >> resource could move, depending on other factors such as stickiness. > > These resources are (still) running fine, suffered only monitor failures > and are node-neutral, so it should be safe to cleanup them, I suppose. Most likely _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
