Re: [Beowulf] Accelerator for data compressing

Eric Thibodeau Tue, 07 Oct 2008 11:20:48 -0700

Dmitri Chubarov wrote:

Hello,


we have got a VX50 down here as well. We have observed very different
scalability on different applications. With an OpenMP molecular
dynamics code we have got over 14 times speedup while on a 2D finite
difference scheme I could not get far beyond 3 fold.

2D finite difference can be comm intensive is the mesh is too small foreach processor to have a fair amount of work to do before needing theneighboring values from a "far" node.

On Tue, Oct 7, 2008 at 10:45 PM, Eric Thibodeau <[EMAIL PROTECTED]> wrote:

PS: Interesting figures, I couldn't resist compressing the same binary DB on
a 16Core Opteron (Tyan VX50) machine and was dumbfounded to get horrible
results given the same context. The processing speed only came up to 6.4E6
bytes/sec ...for 16 cores, and they were all at 100% during the entire run
(FWIW, I tried different block sizes and it does have an impact but this
also changes the problem parameters).


Reading your message in the Beowulf list I should say that it looks
interesting and probably shows something happening with the memory
access on the NUMA nodes. Did you try to run the archiver with
different affinity settings?

I don't have affinity control over the app per say. I would have tolook/modify pbzip's code. Although, note that the PID's assignment toone processor is governed by the kernel and is thus a scheduler issue.Also note that I have noticed that the kernel doesn't just have funmoving the processes around the cores.

We have observed that the memory architecture shows some strange
behaviour. For instance the latency for a read from the same NUMA node
for different nodes varies significantly.

This is the nature of NUMA. Furthermore, if you have to cross to a farCPU, the latency is also dependent on the CPU's load.

Also on the profiler I often see that x86 instructions that have one
of the operands in memory may
take disproportionally long. I believe that could explain the 100% CPU
load reported by the kernel.

How do you identify the specific instruction using a profiler, this issomething that interests me.

From the very little knowledge of this platform that we have got, I
tend to advise the users not to expect good speedup on their

multithreaded applications.

Using OpenMP (from GCC 4.3.x) and an embarrassingly parallel problem(computing K-Means on a large database), I do get significant speedup(15-16).

Yet it would be interesting to get a
better understanding of the programming techniques for this sedecimus
and the similar machines.

OpenMP is IMHO the easiest one that will bring you the most performanceout of 3 lines of #pragma directives. If you manage to get a cluster ofVX50s, then learn a bit of MPI to glue all of this together ;)

Even more so due to the QPI systems becoming
commercially available very soon.

Don't know that one (QPI)...oh...new Intel stuff...no matter how much Itry to stay ahead, I'm always years behind!

 At the moment we have got a few
small kernels written in C and Fortran with OpenMP that we use to
evaluate different parallelization strategies. Unfortunately, there
are no tools I would know of that could help to explain what's going
on inside the memory of this machine.

Of course, check out TAU (http://www.cs.uoregon.edu/research/tau/home.php ), it will at least helpyou identify bottlenecks and give you an impressive profilinginfrastructure.

I am very much interested to hear more about your experience with VX50.

Best regards,
  Dima Chubarov

--
  Dmitri Chubarov
  junior researcher
  Siberian Branch of the Russian Academy of Sciences
  Institute of Computational Technologies
  http://www.ict.nsc.ru/indexen.php


Eric

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Accelerator for data compressing

Reply via email to