Re: [Beowulf] Options for augmenting cluster vector/data-parallel computing power ...

Dan Kidger Wed, 14 Jun 2006 05:18:39 -0700

Richard Walsh wrote:

All,


Could those of you who have perhaps used or researched the general
purpose use of GPUs (vendors, buyers, builders) to augment the data-
parallel compute power of your clusters add, subtract, and/or comment
on the following summary of the current options in this area?  What have

I failed to realize? What other vendors are out there? How difficultare the

programming environments to use?  What performance gains have you
observed?  Do you forecast Cell-based COTS-like clusters?  Interface
issues wtih MPI? Etc.

Thanks in advance ...

rbw

GPGPU compute space options micro-summary:

Option 1:

  Purchase high-performance graphics cards (Geforce, Radeon)
  for ~$400, drop them into your PCI-X slot (PCI-e soon to be
  available, learn some Cg programming, and you're ready to get

10s of additional Gflops per node if you have stream-able kernels.You are limited to 32-bit floating-point (and maybe non-IEEE).

  Also limited by the input/output bandwidth asymmetry of the
  graphics cards and its rigid, compute pipeline with limited conditional
  capability and programmability.

Option 2:

 Purchase ClearSpeed Array processing cards and software for your
 cluster (much more expensive, how much?) to get ~50 Gflops of additional

compute power on steam-able kernels, programming environment ispresumablybetter (is it?), you get full IEEE 64-bit floating point. Do youhave the same

 bandwidth asymmetry issues?

Well I work for Clearspeed, so I can give some factual information onour product, but I will refrain from any hard cell (sic).

Each board has 1GB of memory and 2 CPUs. Each CPU has a serial unit and96 SIMD parallel units (PEs).Each PE has both 64-bit IEEE FP add and multiply units and because ofVLIW can issue a fused Muladd at a rate of one every clock tick.The CPU clocks at 250MHz - this keeps the power consumption down to only25W per dual-cpu board.So theoretical peak performance is 96GF per board, but for marketingreasons we quote the more realistic 50GF that you might expect to seefrom a real app. (albeit well tuned).

If you just do DGEMM or say 2D FFTs, then the libraries are alreadythere - just change your LD_LIBRARY_PATH and it will intercept thoseACML and FFTW calls. If what you do is different - then there is a Ccompiler for the board. Standard C - just with a prefix of 'poly' beforeany variable that you want to have 96 instances rather than just one,and of course parallel implementations on sin(), sqrt() et al.

In your host application - you just use the provided API to initialiseone or more boards, load a binary onto it, send data across and launchyour preloaded binary, then poll to read the results back as they aregenerated.

reads and writes to the board's memory should be close to whatever yourPCIx/PCIe chipset can sustain.


on yes and the list price is $8000 - yes a bit more that your GEForce.


Dr. Daniel Kidger, Technical Consultant, Clearspeed plc, Bristol UK
E: [EMAIL PROTECTED]
T: +44 117 317 2030
M: +44 7738 458742
"Write a wise saying and your name will live forever." - Anonymous.








_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Options for augmenting cluster vector/data-parallel computing power ...

Reply via email to