Rahul Nabar wrote: > On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney<craig.tier...@noaa.gov> wrote: >> What do you mean normally? I am running Centos 5.3 with 2.6.18-128.2.1 >> right now on a 448 node Nehalem cluster. I am so far happy with how things >> work. >> The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support >> where nodes would just start randomly run slow. Upgrading the kernel >> fixed that. But that performance problem was either all or none, I don't >> recall >> it exhibiting itself in the way that Rahul described. >> > > For me it shows: > > Linux version 2.6.18-128.el5 (mockbu...@builder10.centos.org) > > I am a bit confused with the numbering scheme, now. Is this older or > newer than Craigs? You are right Craig, I haven't noticed any random > slowdowns but my data is statistically sparse. I only have a single > Nehalem+CentOS test node right now. >
When you run uname -a you don't get something like: [ctier...@wfe7 serial]$ uname -a Linux wfe7 2.6.18-128.2.1.el5 #1 SMP Thu Aug 6 02:00:18 GMT 2009 x86_64 x86_64 x86_64 GNU/Linux We did build our kernel from source, only because we ripped out the IB so we could build from the latest OFED stack. Try: # rpm -qa | grep kernel And see what version is listed. We have found a few performance problems so far. 1) Nodes would start going slow, really slow. However, when they started to go slow they stayed slow and the problem was cleared by a reboot. This problem was resolved by upgrading to the kernel we use now. 2) Nodes are reporting too many System Events that look like single-bit errors. This again would show up as nodes that would start to go slow, and wouldn't be resolved until a reboot. We no longer things we had lots of bad memory, and the latest BIOS may have fixed it. We are upload that bios now and will start checking. The only time I was getting variability in timings was when I wasn't pinning processes and memory correctly. My tests have always used all the cores in a node though. I think that OpenMPI is doing the correct thing with mpi_affinity_alone. For mvapich, we wrote a wrapper script (similar to TACC) that uses numactl directly to pin memory and threads. Craig -- Craig Tierney (craig.tier...@noaa.gov) _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf