On 21 Mar 2014, at 12:40 am, Michał Margula <[email protected]> wrote:
> Hello, > > We had many unresolved issues some time ago with Pacemaker. I think > almost all of them got solved by fixing link between clusters (removed > media converters, replaced them with NIC with SFP+, upgraded to 10Gbps). > > Now it seems to be working fine with few exceptions: > > - if I kill one node manually (power off, but IPMI is still operational > so stonith is working fine) > > or > > - if I move one of nodes to standby and it had few Xen domUs > > > It gets Unclean. Funny thing is that if I kill (or make a standby) node > B, also node A gets unclean. So I have situation that crm_mon shows > Node-A: UNCLEAN (Online), Node-B: Unclean (OFFLINE). To be honest I have > much trouble diagnosing it (BTW: is there a some kind of documentation > how to read logs of pacemaker?) > > One thing I found that makes me worried is: > > Mar 20 04:16:39 rivendell-A kernel: [ 774.635312] stonithd[10089]: > segfault at 0 ip 00007f51a1aa5bd4 sp 00007fff20c7fb50 error 4 in > libcrmcommon.so.2.0.0[7f51a1a93000+2d000] > > And it happens on both nodes. And also it seems that it only happens > when I define manual fencing device (meatware) as such: > > primitive manual-fencing-of-A stonith:meatware \ > params hostlist="rivendell-B" \ > op monitor interval="60s" \ > meta target-role="Started" > primitive manual-fencing-of-B stonith:meatware \ > params hostlist="rivendell-A" \ > op monitor interval="60s" \ > meta target-role="Started" > location location-manual-fencing-of-A manual-fencing-of-A -inf: rivendell-A > location location-manual-fencing-of-B manual-fencing-of-B -inf: rivendell-B > > Here is our configuration which currently is used (without manual > fencing) - http://pastebin.com/CudX6wx3 > > BTW - is there a way to recover from such situation? I can only fix it > by restarting corosync or rebooting a node. But it then kills other node > because of UNCLEAN state. > > Also if it is a pacemaker bug how to debug it/fix it? We are currently > using Debian Wheezy 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff. First step is looking at the logs for any errors. Second is producing a stack trace from the crash. Third is reading http://blog.clusterlabs.org/blog/2014/potential-for-data-corruption-in-pacemaker-1-dot-1-6-through-1-dot-1-9/ and getting a newer version. > > I see there are more up to date versions but not with Debian. Should I > consider upgrading? > > Thank you! > > -- > Michał Margula, [email protected], http://alchemyx.uznam.net.pl/ > "W życiu piękne są tylko chwile" [Ryszard Riedel] > > > _______________________________________________ > Pacemaker mailing list: [email protected] > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
