Richard Walsh wrote:

All,

Could those of you who have perhaps used or researched the general
purpose use of GPUs (vendors, buyers, builders) to augment the data-
parallel compute power of your clusters add, subtract, and/or comment
on the following summary of the current options in this area?  What have
I failed to realize? What other vendors are out there? How difficult are the
programming environments to use?  What performance gains have you
observed?  Do you forecast Cell-based COTS-like clusters?  Interface
issues wtih MPI? Etc.

Thanks in advance ...

rbw

GPGPU compute space options micro-summary:

Option 1:

  Purchase high-performance graphics cards (Geforce, Radeon)
  for ~$400, drop them into your PCI-X slot (PCI-e soon to be
  available, learn some Cg programming, and you're ready to get
10s of additional Gflops per node if you have stream-able kernels. You are limited to 32-bit floating-point (and maybe non-IEEE).
  Also limited by the input/output bandwidth asymmetry of the
  graphics cards and its rigid, compute pipeline with limited conditional
  capability and programmability.

Option 2:

 Purchase ClearSpeed Array processing cards and software for your
 cluster (much more expensive, how much?) to get ~50 Gflops of additional
compute power on steam-able kernels, programming environment is presumably better (is it?), you get full IEEE 64-bit floating point. Do you have the same
 bandwidth asymmetry issues?


Well I work for Clearspeed, so I can give some factual information on our product, but I will refrain from any hard cell (sic).

Each board has 1GB of memory and 2 CPUs. Each CPU has a serial unit and 96 SIMD parallel units (PEs). Each PE has both 64-bit IEEE FP add and multiply units and because of VLIW can issue a fused Muladd at a rate of one every clock tick. The CPU clocks at 250MHz - this keeps the power consumption down to only 25W per dual-cpu board. So theoretical peak performance is 96GF per board, but for marketing reasons we quote the more realistic 50GF that you might expect to see from a real app. (albeit well tuned).

If you just do DGEMM or say 2D FFTs, then the libraries are already there - just change your LD_LIBRARY_PATH and it will intercept those ACML and FFTW calls. If what you do is different - then there is a C compiler for the board. Standard C - just with a prefix of 'poly' before any variable that you want to have 96 instances rather than just one, and of course parallel implementations on sin(), sqrt() et al.

In your host application - you just use the provided API to initialise one or more boards, load a binary onto it, send data across and launch your preloaded binary, then poll to read the results back as they are generated.

reads and writes to the board's memory should be close to whatever your PCIx/PCIe chipset can sustain.

on yes and the list price is $8000 - yes a bit more that your GEForce.


Dr. Daniel Kidger, Technical Consultant, Clearspeed plc, Bristol UK
E: [EMAIL PROTECTED]
T: +44 117 317 2030
M: +44 7738 458742
"Write a wise saying and your name will live forever." - Anonymous.








_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to