> Message: 1 > Date: Sat, 12 Jul 2014 09:42:57 -0400 > From: Ken Gaillot <[email protected]> > To: [email protected] > Subject: [Pacemaker] Occasional nonsensical resource agent errors > since Debian 3.2.57-3+deb7u1 kernel update > Message-ID: <[email protected]> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hi, > > We run multiple deployments of corosync+pacemaker on Debian "wheezy" for > high-availability of various resources. The configurations are unchanged > and ran without any issues for many months. However, since we applied > the Debian 3.2.57-3+deb7u1 kernel update in May, we have been getting > resource agent errors on rare occasions, with error messages that are > clearly incorrect. > > > [....] > > Given the odd error messages from the resource agent, I suspect it's a > memory corruption error of some sort. We've been unable to find anything > else useful in the logs, and we'll probably end up reverting to the > prior kernel version. But given the rarity of the issue, it would be a > long while before we could be confident that fixed it. > > Is anyone else running pacemaker on Debian with 3.2.57-3+deb7u1 kernel > or later? Has anyone had any similar issues?
Just curious, I see you're running Xen; are you setting dom0_mem? I had similar issues with SLES 11 SP2 and SP3 (but not <= SP1) that was apparently random memory corruption due to a kernel bug. It was mostly random but I did eventually find a repeatable test case: checksum verification of a kernel build tree with mtree; on affected systems there would usually be a few files that failed to verify. I had been setting dom0_mem=768M, as that was a good balance between maximizing memory available for VMs while keeping enough for services in Dom0 (including pacemaker/corosync), and I set node attributes for pacemaker utilization to 1GB less than physical RAM, leaving 256M available for Xen overhead, etc. Raising it to 2048M (or not setting it at all) was a sufficient workaround to avoid the bug, but I have finally received a fixed kernel from Novell support. Note: this fix has not yet made it into any official updates for SLES 11 -- Novell/SUSE say it will be in the next kernel version, whenever that happens. Recent openSUSE kernels are also affected (and have yet to be fixed). -Andrew _______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
