----- Original Message ----- From: "Daniel Pfenniger" <[EMAIL PROTECTED]>
To: "Jim Lux" <[EMAIL PROTECTED]>
Cc: <beowulf@beowulf.org>
Sent: Thursday, March 16, 2006 6:32 PM
Subject: Re: [Beowulf] Vector coprocessors




Jim Lux wrote:
...
There are probably applications where a dedicated card can blow the doors off a collection of PCs. At some point, the interprocessor communication latency inherent in any sort of cabling between processors would start to dominate.

As usual it depends on the applications. Vector computations
are not universal, even if frequent in technical problems.
In the favorable cases it is not rare to have say over 10% serial
code that does not benefit from the card. In the end the card, despite its
192 procs, may just accelerate typical applications by a factor a few.

If Clearspeed would consider mass production with a cost like $100.-$500.-

If you produce such cards in low quantity you lose roughly 100 dollar to the pci card to royalties basically then add chip production price. 2 big chips, well i do not know what price they are. Sound expensive to me. I talked about 1 big chip for some other card.

That chip had a price, when mass produced, of 50 dollar a chip.

So bare production price of this card i estimate at around 250 dollar. You don't want to lose bigtime
on such a card of course.

That means an importer price of 500 and a consumer price is a minimum of 1000 dollar.

Now you skip the importer of course with such types of cards.

According to my economy book then a company can then follow 2 approaches. You can try to flood the market and sell 50 million of them, which means that the card will be priced 1000 dollar.

Or you can act realistic that even a lower price of the card will not increase sales by more than a factor 2.

In short the highest price you can reasonably ask is most interesting, because there is plenty of universities who want a few cards to toy with. They pay 8000+ dollar and that's really a minimum for those guys, because if they start toying they'll ship you 100 questions, after which the card gets dusted and keeps unused.

If you're serious and you want to buy 200 of their cards, then you're a big customer. Propose them a secret deal in this sense that you don't publicly reveal the price paid, and you sign for it that first 3 years you won't resell their cards nor lend them nor hire them
to other persons. Under that condition you offer $200k for 200 cards.

After some giving and taking you pay in the end $2000 a card.

Not bad for a 10 Tflop cluster double precision for roughly $500k in that case.

Note that if you build a cluster from such cards that the bandwidth your PCI-X has will be a limitation to other nodes anyway. Now only difference is that it is a limitation from every point to every point. So that's in fact a more symmetric
programming approach from programmers viewpoint.
Nothing new there.

per card the market would be huge, because the card would be competing with
multi-core processors like the IBM-Sony Cell.

You need "really big" volumes to get there.

Such cards aren't competing with the Cell at all.

You are only competing if products can get bought in a store at the same day.

Yes, but it does not seem to me unreasonable to put such a card in
millions of PC's if the average applications run a bit faster and the
cost increase stays below the PC cost.  After all
the 8087 math coprocessor of the i386 era did just that.

The average user cares for faster running his game, not for some double precision floating point monster. Most 3d graphics operations i'd qualify as single precision, not as double precision.

....

I would say that there is more potential for a clever soul to reprogram the guts of Matlab, etc., to transparently share the work across multiple machines. I think that's in the back of the mind of MS, as they move toward a services environment and .NET

Lots of people have thought about that for a long time, including
Cleve Moeller.   The potential clever soul should be well above
average, and considering MS products, well above MS average programmer.

An intriguing way to parallelize C with threads on multicore processors is
provided by Cilk (http://supertech.lcs.mit.edu/cilk/).  Cilk consists of
a couple of simple extensions to the C language.

If anyone has experience with Cilk it would be nice to share.

RAISES A BIG HAND.

Dan

I guess you googled a bit around and found the prehistoric ancient parallel programming language CILK.
You can use this inside C code indeed.

Yes i know how to use it. I also know how it performed in chessprograms.

In the famous cilk chess program.

Called Cilkchess from MIT. Leierson & co. Programmed actually by Don Dailey, a very nice guy.

Single cpu his program got around 180k nps.
With cilk compiled it achieved around 5000 nodes a second.

That was single cpu.

When running at a 512 processor machine it dropped even more in performance.

I remember playing them and their scaling was pretty bad.

They basically claimed a good scaling as they assumed 1 cpu = 5000 nodes.
However i calculated their scaling as 1 cpu = 180k nodes, as without cilk it's 180k nps.

This scientific way to look good on paper is very well known.

First slow down your parallel program a dozen times, in this case 20-50 times in order to
show better scaling.

MIT's scaling as calculated by me was around 2%.

On other hand, i've parallellized my chessprogram myself instead of using Cilk
That was of course a lot harder, and took me 1.5 years of hard programming,
but it scales very well.

It scales 50+% at 512 processors.

Actually on paper i could claim probably 100+%, as 1 cpu could search 20k nps, and at 460 cpu's, i reached a peek of around 9.99 mln nps. Now the reason for that is because i used global transposition table.

AFAIK Cilkchess wasn't using that, which makes their achievement even more pathetic.

Yet you can look at it from another viewpoint.

If you want to parallellize something, if the machine isn't yours anyway, and if you want a quick result,
why not allocate 10000 processors and run it with cilk?

You know, if you lose a factor 100 or so, who cares, you're still 100 times faster than your PC!

That said there probably will be cases of programs that scale better with cilk, if they are really embarrassingly parallel.

But you better have some good highend network with Cilk anyway.

Cilk is like programming in BASIC.

BASIC is easy for beginners and you can get a job quickly done, if it's embarrassingly parallel, you might not even
suffer too much performance penalty, and you didn't lose time yours.

If you want something that performs better, then consider MPI.

If you want to be faster than MPI, then parallellize things without MPI within 1 shared memory node, and parallellize
between the nodes with MPI. It's more effort than CILK for sure.

But don't use CILK to get the 'maximum' performance out of a machine.
That's wishful thinking.

Vincent

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to