This will boil down to a questions eventually, but I need to give some background first. We are a small group doing CFD, and when we several years ago realized that beowulfs would be the right choice for us we decided to extend our computational capabilities gradually. Every year, or every second year we bought two gigabit switches and a bunch of nodes connected to these switches. One of the switches is used for mpi communications and one for connecting the nodes to a fileserver and a master node.
As of today we have five "subclusters", all connected to the same filserver and master node (torque/maui is used to distribute the jobs on the different subclusters). This has worked out great for us, and we do believe the strategy of buying gradually has been advantageous to us (instead of doing larger purchases less often), and we want to continue extending our hardware in this fashion. Up till now we have not been hurt by the fact that we have a single fileserver (connected to a bunch of raided drives), but we anticipate there will be issues when we further extend the number of nodes. And we plan on building a separate "infiniband storage network" (consisting of a 24 DDR switch) and connect a number of "gluster nodes" to it. Each subcluster will then be connected to this "infiniband storage network" via one (or maybe several) ports. However, we will still limit the jobs to run within there separate subcluster and we are going to accept lower bandwidth between the subclusters. By doing this we gain the following: (i) We can get more computational nodes, since we are limiting the number of ports used to connect the switches to each other. (ii) For our application I/O is not as demanding as the "mpi-communiction" but we are still getting - hopefully - acceptable I/O performance. (iii) We can extend our storage by adding more "gluster nodes" to the "infiniband storage network" when needed. (iv) We can continue adding subclusters when we have the money. And we can also remove old ones when they "cost" too much (in terms of electricity/performance, maintenance etc.). Since we havent worked with infiniband before, the question is simply if there could be issues with this approach? Regards, and thanks, /jon _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf