With so many nodes i'd go for either infiniband or quadrics, assuming the
largest partition also gets 512 nodes.
Scales way better at so many nodes, as your software will need really a lot
of
communications as you'll probably need quite a lot of RAM for the
applications at all nodes.
Of course most want to sell you myri as it's simply cheaper; they might earn
more onto it a node.
For this type of code, the network you use and the total amount of RAM are
the 2 most important choices.
You could consider putting 2 network cards in each node, assuming each node
is quite big, in order to give
the highend network completely to the RAM communication.
As i/o already has quite a huge latency, for the slow latency network for
i/o you could do with a huge bandwidth network
and bad latency and a real state of the art highend network for the memory
communication.
The problems you can expect depend largely on the number of users that's
gonna use your cluster simultaneously.
More users = more problems.
Just avoid using all that commercial software for putting nodes to work that
most manufacturers try to sell you.
My experience is that PDSH works pretty good to start work.
Does your software handle dying nodes and can the network hotswap them?
If not, just consider the odds that sometimes a node needs maintenance.
How do you want to divide the cluster, into 1 partition of 512 nodes, or do
you plan all kind of small partitions?
A network is of course more expensive when you have 1 huge cluster than when
you divide it in small partitions.
If a node dies, then with several small partitions, your other partitions
run further without problems. Just the partition with
the dying node has a problem.
Most likely that dying node just has some dust inside its psu :)
Vincent
----- Original Message -----
From: "Walid" <[EMAIL PROTECTED]>
To: <beowulf@beowulf.org>
Sent: Wednesday, April 26, 2006 11:34 AM
Subject: [Beowulf] 512 nodes Myrinet cluster Challanges
Hi all,
Does any one know what types of problems/challanges for big clusters?
we are considering having a 512 node cluster that will be using
Myrinet as its main interconnect, and would like to do our homework
The cluster is meant to run an inhouse fluid simulation application
that is I/O intensve, and requires large memory models.
any hints, pointers will be apperciated
TIA
Walid.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf