Richard Walsh wrote:
All,
Could those of you who have perhaps used or researched the general
purpose use of GPUs (vendors, buyers, builders) to augment the data-
parallel compute power of your clusters add, subtract, and/or comment
on the following summary of the current options in this area? What have
I failed to realize? What other vendors are out there? How difficult
are the
programming environments to use? What performance gains have you
observed? Do you forecast Cell-based COTS-like clusters? Interface
issues wtih MPI? Etc.
Thanks in advance ...
rbw
GPGPU compute space options micro-summary:
Option 1:
Purchase high-performance graphics cards (Geforce, Radeon)
for ~$400, drop them into your PCI-X slot (PCI-e soon to be
available, learn some Cg programming, and you're ready to get
10s of additional Gflops per node if you have stream-able kernels.
You are limited to 32-bit floating-point (and maybe non-IEEE).
Also limited by the input/output bandwidth asymmetry of the
graphics cards and its rigid, compute pipeline with limited conditional
capability and programmability.
Option 2:
Purchase ClearSpeed Array processing cards and software for your
cluster (much more expensive, how much?) to get ~50 Gflops of additional
compute power on steam-able kernels, programming environment is
presumably
better (is it?), you get full IEEE 64-bit floating point. Do you
have the same
bandwidth asymmetry issues?
Well I work for Clearspeed, so I can give some factual information on
our product, but I will refrain from any hard cell (sic).
Each board has 1GB of memory and 2 CPUs. Each CPU has a serial unit and
96 SIMD parallel units (PEs).
Each PE has both 64-bit IEEE FP add and multiply units and because of
VLIW can issue a fused Muladd at a rate of one every clock tick.
The CPU clocks at 250MHz - this keeps the power consumption down to only
25W per dual-cpu board.
So theoretical peak performance is 96GF per board, but for marketing
reasons we quote the more realistic 50GF that you might expect to see
from a real app. (albeit well tuned).
If you just do DGEMM or say 2D FFTs, then the libraries are already
there - just change your LD_LIBRARY_PATH and it will intercept those
ACML and FFTW calls. If what you do is different - then there is a C
compiler for the board. Standard C - just with a prefix of 'poly' before
any variable that you want to have 96 instances rather than just one,
and of course parallel implementations on sin(), sqrt() et al.
In your host application - you just use the provided API to initialise
one or more boards, load a binary onto it, send data across and launch
your preloaded binary, then poll to read the results back as they are
generated.
reads and writes to the board's memory should be close to whatever your
PCIx/PCIe chipset can sustain.
on yes and the list price is $8000 - yes a bit more that your GEForce.
Dr. Daniel Kidger, Technical Consultant, Clearspeed plc, Bristol UK
E: [EMAIL PROTECTED]
T: +44 117 317 2030
M: +44 7738 458742
"Write a wise saying and your name will live forever." - Anonymous.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf