Dmitri Chubarov wrote:
Hello,
we have got a VX50 down here as well. We have observed very different
scalability on different applications. With an OpenMP molecular
dynamics code we have got over 14 times speedup while on a 2D finite
difference scheme I could not get far beyond 3 fold.
2D finite difference can be comm intensive is the mesh is too small for
each processor to have a fair amount of work to do before needing the
neighboring values from a "far" node.
On Tue, Oct 7, 2008 at 10:45 PM, Eric Thibodeau <[EMAIL PROTECTED]> wrote:
PS: Interesting figures, I couldn't resist compressing the same binary DB on
a 16Core Opteron (Tyan VX50) machine and was dumbfounded to get horrible
results given the same context. The processing speed only came up to 6.4E6
bytes/sec ...for 16 cores, and they were all at 100% during the entire run
(FWIW, I tried different block sizes and it does have an impact but this
also changes the problem parameters).
Reading your message in the Beowulf list I should say that it looks
interesting and probably shows something happening with the memory
access on the NUMA nodes. Did you try to run the archiver with
different affinity settings?
I don't have affinity control over the app per say. I would have to
look/modify pbzip's code. Although, note that the PID's assignment to
one processor is governed by the kernel and is thus a scheduler issue.
Also note that I have noticed that the kernel doesn't just have fun
moving the processes around the cores.
We have observed that the memory architecture shows some strange
behaviour. For instance the latency for a read from the same NUMA node
for different nodes varies significantly.
This is the nature of NUMA. Furthermore, if you have to cross to a far
CPU, the latency is also dependent on the CPU's load.
Also on the profiler I often see that x86 instructions that have one
of the operands in memory may
take disproportionally long. I believe that could explain the 100% CPU
load reported by the kernel.
How do you identify the specific instruction using a profiler, this is
something that interests me.
From the very little knowledge of this platform that we have got, I
tend to advise the users not to expect good speedup on their
multithreaded applications.
Using OpenMP (from GCC 4.3.x) and an embarrassingly parallel problem
(computing K-Means on a large database), I do get significant speedup
(15-16).
Yet it would be interesting to get a
better understanding of the programming techniques for this sedecimus
and the similar machines.
OpenMP is IMHO the easiest one that will bring you the most performance
out of 3 lines of #pragma directives. If you manage to get a cluster of
VX50s, then learn a bit of MPI to glue all of this together ;)
Even more so due to the QPI systems becoming
commercially available very soon.
Don't know that one (QPI)...oh...new Intel stuff...no matter how much I
try to stay ahead, I'm always years behind!
At the moment we have got a few
small kernels written in C and Fortran with OpenMP that we use to
evaluate different parallelization strategies. Unfortunately, there
are no tools I would know of that could help to explain what's going
on inside the memory of this machine.
Of course, check out TAU (
http://www.cs.uoregon.edu/research/tau/home.php ), it will at least help
you identify bottlenecks and give you an impressive profiling
infrastructure.
I am very much interested to hear more about your experience with VX50.
Best regards,
Dima Chubarov
--
Dmitri Chubarov
junior researcher
Siberian Branch of the Russian Academy of Sciences
Institute of Computational Technologies
http://www.ict.nsc.ru/indexen.php
Eric
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf