Re: [Beowulf] Vector coprocessors AND CILK

Vincent Diepeveen Tue, 21 Mar 2006 19:30:13 -0800

----- Original Message -----From: "Daniel Pfenniger" <[EMAIL PROTECTED]>

To: "Jim Lux" <[EMAIL PROTECTED]>
Cc: <beowulf@beowulf.org>
Sent: Thursday, March 16, 2006 6:32 PM
Subject: Re: [Beowulf] Vector coprocessors

Jim Lux wrote:
...
There are probably applications where a dedicated card can blow the doorsoff a collection of PCs. At some point, the interprocessor communicationlatency inherent in any sort of cabling between processors would start todominate.
As usual it depends on the applications. Vector computations
are not universal, even if frequent in technical problems.
In the favorable cases it is not rare to have say over 10% serial
code that does not benefit from the card. In the end the card, despiteits
192 procs, may just accelerate typical applications by a factor a few.
If Clearspeed would consider mass production with a cost like$100.-$500.-

If you produce such cards in low quantity you lose roughly 100 dollar to thepci card toroyalties basically then add chip production price. 2 big chips, well i donot know what pricethey are. Sound expensive to me. I talked about 1 big chip for some othercard.


That chip had a price, when mass produced, of 50 dollar a chip.

So bare production price of this card i estimate at around 250 dollar. Youdon't want to lose bigtime

on such a card of course.

That means an importer price of 500 and a consumer price is a minimum of1000 dollar.


Now you skip the importer of course with such types of cards.

According to my economy book then a company can then follow 2 approaches.You can try toflood the market and sell 50 million of them, which means that the card willbe priced 1000 dollar.

Or you can act realistic that even a lower price of the card will notincrease sales by more than a factor 2.

In short the highest price you can reasonably ask is most interesting,because there is plenty ofuniversities who want a few cards to toy with. They pay 8000+ dollar andthat's really a minimum for those guys,because if they start toying they'll ship you 100 questions, after which thecard gets dusted and keeps unused.

If you're serious and you want to buy 200 of their cards, then you're a bigcustomer.Propose them a secret deal in this sense that you don't publicly reveal theprice paid,and you sign for it that first 3 years you won't resell their cards nor lendthem nor hire them

to other persons. Under that condition you offer $200k for 200 cards.

After some giving and taking you pay in the end $2000 a card.

Not bad for a 10 Tflop cluster double precision for roughly $500k in thatcase.

Note that if you build a cluster from such cards that the bandwidth yourPCI-X has will be a limitation to other nodesanyway. Now only difference is that it is a limitation from every point toevery point. So that's in fact a more symmetric

programming approach from programmers viewpoint.
Nothing new there.

per card the market would be huge, because the card would be competingwith
multi-core processors like the IBM-Sony Cell.
You need "really big" volumes to get there.


Such cards aren't competing with the Cell at all.

You are only competing if products can get bought in a store at the sameday.

Yes, but it does not seem to me unreasonable to put such a card in
millions of PC's if the average applications run a bit faster and the
cost increase stays below the PC cost.  After all
the 8087 math coprocessor of the i386 era did just that.

The average user cares for faster running his game, not for some doubleprecision floating point monster.Most 3d graphics operations i'd qualify as single precision, not as doubleprecision.

....
I would say that there is more potential for a clever soul to reprogramthe guts of Matlab, etc., to transparently share the work across multiplemachines. I think that's in the back of the mind of MS, as they movetoward a services environment and .NET
Lots of people have thought about that for a long time, including
Cleve Moeller.   The potential clever soul should be well above
average, and considering MS products, well above MS average programmer.

An intriguing way to parallelize C with threads on multicore processors is
provided by Cilk (http://supertech.lcs.mit.edu/cilk/).  Cilk consists of
a couple of simple extensions to the C language.

If anyone has experience with Cilk it would be nice to share.


RAISES A BIG HAND.

Dan

I guess you googled a bit around and found the prehistoric ancient parallelprogramming language CILK.

You can use this inside C code indeed.

Yes i know how to use it. I also know how it performed in chessprograms.

In the famous cilk chess program.

Called Cilkchess from MIT. Leierson & co. Programmed actually by Don Dailey,a very nice guy.


Single cpu his program got around 180k nps.
With cilk compiled it achieved around 5000 nodes a second.

That was single cpu.

When running at a 512 processor machine it dropped even more in performance.

I remember playing them and their scaling was pretty bad.

They basically claimed a good scaling as they assumed 1 cpu = 5000 nodes.

However i calculated their scaling as 1 cpu = 180k nodes, as without cilkit's 180k nps.


This scientific way to look good on paper is very well known.

First slow down your parallel program a dozen times, in this case 20-50times in order to

show better scaling.

MIT's scaling as calculated by me was around 2%.

On other hand, i've parallellized my chessprogram myself instead of usingCilk

That was of course a lot harder, and took me 1.5 years of hard programming,
but it scales very well.

It scales 50+% at 512 processors.

Actually on paper i could claim probably 100+%, as 1 cpu could search 20knps, and at 460 cpu's,i reached a peek of around 9.99 mln nps. Now the reason for that is becausei used global transposition table.

AFAIK Cilkchess wasn't using that, which makes their achievement even morepathetic.


Yet you can look at it from another viewpoint.

If you want to parallellize something, if the machine isn't yours anyway,and if you want a quick result,

why not allocate 10000 processors and run it with cilk?

You know, if you lose a factor 100 or so, who cares, you're still 100 timesfaster than your PC!

That said there probably will be cases of programs that scale better withcilk, if they are really embarrassingly parallel.


But you better have some good highend network with Cilk anyway.

Cilk is like programming in BASIC.

BASIC is easy for beginners and you can get a job quickly done, if it'sembarrassingly parallel, you might not even

suffer too much performance penalty, and you didn't lose time yours.

If you want something that performs better, then consider MPI.

If you want to be faster than MPI, then parallellize things without MPIwithin 1 shared memory node, and parallellize

between the nodes with MPI. It's more effort than CILK for sure.

But don't use CILK to get the 'maximum' performance out of a machine.
That's wishful thinking.

Vincent

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Vector coprocessors AND CILK

Reply via email to