> Message: 6
> Date: Tue, 15 Jul 2014 15:36:40 -0400
> From: Ken Gaillot <[email protected]>
> To: The Pacemaker cluster resource manager
> <[email protected]>
> Subject: Re: [Pacemaker] Occasional nonsensical resource agent errors
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hi Andrew,
>
> Thanks for the feedback!
>
> Our "aries/taurus" cluster are Xen dom0s, and we pin dom0_mem so there's
> at least 1GB RAM reported in the dom0 OS. (The version of Xen+Linux
> kernel in wheezy has an issue where the reported RAM is less than the
> dom0_mem value, so dom0_mem is actually higher.)
>
> However we are also seeing the issue on our "talos/pomona" cluster,
> which are not dom0s, so I don't suspect Xen itself. But it could be the
> same kernel issue.
>
> mtree isn't packaged for Debian, and I'm not familiar with it, although
> I did see a Linux port on Google code. How do you use it for your test
> case? What do the detected differences signify?
That mtree-port from Google code is what I used; fortunately for me it was
packaged in the OBS already: http://software.opensuse.org/package/mtree
It looks like the only build-dep it has is openssl-devel, so not too hard to
build. I'm sure there's other utilities that accomplish the same thing (e.g.
tripwire) but I was familiar with mtree from BSD-land, so it's what I used.
Backtracking a bit, when I saw these strange errors, running 'rpm -Va' (verify
installed files from all packages; there's probably a dpkg equivalent but I
don't know it off-hand) would sometimes, but not consistently, produce errors.
I decided that perhaps I needed a bigger dataset, and I had been playing with
zfsonlinux on another box, which had several kernel trees extracted for that,
so I tarballed the build dirs (2.6GB, 171k files), checksummed them with mtree,
then copied the tarball and checksum file to the boxen with problems and
verified it there. I actually had to boot into a known good kernel (in my
case, kernel-default rather than kernel-xen) to get a clean untar.
Under the problematic kernels, a small number of files would fail to verify
(which files failed tended to change, but I would almost always get some
errors). Occasionally the filesystem would also report I/O errors (much more
likely to happen under btrfs than xfs or ext3), but after rebooting and running
fsck/xfsrepair/btrfs scrub etc. the FS would check out clean.
Basic mtree usage--
Generate checksum file:
1) cd /path/to/testroot
2) mtree -c -K sha256digest > /path/to/checksumfile [outside testroot]
Verify:
1) cd /path/to/testroot
2) mtree -f /path/to/checksumfile
Like diff, only differences (in file size/mode/checksum/etc.) are reported and
no output means everything verifies.
> Do you know what kernel and Xen versions were in SP2/3, and what
> specifically was fixed in the kernel they gave you?
SLES 11 SP2 and SP3 seem to be based on the same 3.0.x kernel tree (SP1, which
was unaffected, was 2.6.32.x). When SP2 was still supported (it has now
dropped out of support) the versions tended to track closely but not exactly.
Xen in SP3 is 4.2.4; SP2 was 4.1.x. In a matter of fortuitous timing, the
official kernel update for SLES 11 SP3 was released yesterday; the version with
the fix is 3.0.101-0.35.1. The relevant changelog is this:
====
* Thu Jun 05 2014 [email protected]
- swiotlb: don't assume PA 0 is invalid (bnc#865882).
====
Unfortunately that bug is private, even to me, but the git tree is public:
http://kernel.opensuse.org/cgit/kernel-source/commit/?id=0a9fc1a8654e9f62d7a8173fef83c6949ed67e35
http://kernel.opensuse.org/cgit/kernel-source/commit/?h=SLE11-SP3&id=4461f4df6e363235e2ef3b61c41617f7c22dc510
The master aka opensuse-factory branch is on 3.16 (was 3.15 at time of this
commit), while SLE11-SP3 remains on 3.0.x with backported fixes. This may not
be the bug you're hitting, but if you can find a reproducible test case, that's
half the battle.
-Andrew
_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org