[Beowulf] Project Planning: Storage, Network, and Redundancy Considerations

Brian R. Smith Mon, 19 Mar 2007 07:40:06 -0800

Hey list,

I am seeking some advice regarding our latest project. Currently, ourshop runs 5 different clusters of varying size and handles themaintenance and administration of each. I've been planning for sometime to finally consolidate all of these machines together under asingle head-node with a common storage pool (for /home, /opt,/usr/local), utilizing SGE for our resource management. A lot of timeson this list, the point comes up that many things depend upon yourapplications so I'll make it clear here: Our "application" is quitevaried. Our users come from a wide variety of disciplines and thenature of our group is as a sort of tier-2 scientific computing labwhere we provide hardware, development environments, and support fordeveloping and running applications of various nature hence general-purpose.

We have fairly robust systems in place for node provisioning (anin-house system that utilizes kickstart and anaconda that supportsmultiple architectures), resource management (SGE has proven extremelyreliable and more than capable of managing our fairly quaint resources).

Currently, my two largest problems are figuring out our storage needs(in terms of device bandwidth and throughput) and our network needs.When all is said and done, this is the hardware I expect to have:

~60x 16GiB RAM, Dual-Dual-Core AMD Opterons, IB-connected,GigE-connected, with modest local storage8x 16GiB RAM, Dual-Dual-Core AMD Opterons & 24x 8GiBRAM, Dual-Opterons,Myrinet-connected, GigE-connected, modest storage (cluster1)


We wish to add to this cluster the following existing configurations:
12x AMD Opteron 246, 4GiB RAM, Myrinet-connected, etc. (cluster2)
38x AMD Opteron 246, 4GiB RAM, GigE-connected (clusters 3 & 4)
~40x Intel P4 Xeon @2.66GHz, 2GiB RAM, GigE-connected (cluster5)

Yes, I know the last sets of machines are approaching (or already are)legacy status (especially the last batch), but these machines are stilluseful at running the problems they were originally purchased for(especially the Opterons), and are still very good at some other generaltasks (Distributed Matlab, commercial FE codes, instructional use, etc).Currently, each cluster has its own local storage, averaging about a 1TBon each. We've currently got about 4TB of total data across all ofthese machines but anticipate this number possibly doubling with in thenext 12-18 months. The first phase of this plan (which must occur inconcert with the second) is to consolidate all of these disparate arraysinto one volume that is accessible by every node in the cluster. I knowthat some of the supercomputing centers like NCSA have dealt with muchlarger-scale storage issues than this so I'd love to hear from one ofyou. The current ideas that we have been floating around include thefollowing:

1. Proprietary parallel storage systems (like Panasas, etc.): Itprovides the per-node bandwidth, aggregate bandwidth, cachingmechanisms, fault-tolerance, and redundancy that we require (plus havinga vendor offering 24x7x365 support & 24 hour turnover is quite a breathof fresh air for us). Price point is a little high for the amount ofstorage that we will get though, little more than doubling our currentoverall capacity. As far as I can tell, I can use this device as apermanent data store (like /home) and also as the user's scratch spaceso that there is only a single point for all data needs across thecluster. It does, however, require the installation of vendor kernelmodules which do often add overhead to system administration (as theyneed to be compiled, linked, and tested before every kernel update).

2. Separate /home and /scratch volumes. /home would be NFS exportedread-only to all hosts (to prevent writes during run-time). The volumewould reside on one or two file servers (Sun's Thumper/X4500, etc.either on JFS or GFS (or perhaps ZFS???), depending on hardware) and atcurrent prices, we would be able to acquire around 20TB. We woulddouble this purchase and provide the same setup off-site for redundancy(including our tape-backup regime). Bandwidth for reads is more thansufficient for the needs of our current users. The scratch space wouldbe comprised of 8-12 nodes with 0.5 TB RAID1 storage utilizing eitherPVFS2 (which has worked exceptionally well for us previously) or Lustre(which we have not tested very well yet). Both require separate kernelmodules (this seems to be a recurring theme) and hence some additionaladministration. Neither are well-suited for general tasks such ascompiling (though there are ways around this) or problems involving manyshort writes, but most of the applications being run do not fit thisprofile. 8-12 nodes should provide us between 3-6TB of usable scratch.We would like a little more, but again, this is sufficient for ourcurrent usage patterns. The pricing for this might be somewhat lessthan the proprietary system described above.


Can anyone suggest any other approaches to this problem?

We also have a problem regarding how to link these clusters togetherover a single network fabric (GigE). It will be possible for all nodesto utilize this network for Message Passing, but it is highly improbablethat such a scenario will ever be played out since almost all of our MPIjobs will no doubt run on either the Infiniband our Myrinet nodes (thereare SGE policies in place to help ensure this).Currently, each cluster has its own GigE network for provisioning,administration, and resource management. Some of these hosts utilize itfor communications (clusters 3, 4, & 5) and all of them will no doubtneed to utilize it for filesystem access. Clusters 3 and 4 can beconsolidated to a single GigE HP switch that will have a couple of portsleft over. Cluster 5 will have to be kept as-is and clusters 1 and 2will fit on a single switch as well. I have discussed with our campusnetwork admin the possibility of using two recent cisco switches thatwould support failover and load balancing as a redundant andhigh-bandwidth "trunk" for each of these networks, obviously with thecapacity to grow in the future. Each of our existing 3 switches wouldhave up to two links to each "trunk" switch and our file servers (inwhich ever configuration we eventually choose) would also be attached tothese switches. There should be enough bandwidth to go around underthis plan. I'm just curious if this seems doable and if it is, arethere any obvious pitfalls that I have overlooked? Is there perhaps abetter way to approach this (perhaps a single, large switch instead)?

Our final problem is a relatively simple one though I am definitely anewbie to the H.A. world. Under this consolidation plan, we will haveonly one point of entry to this cluster and hence a single point offailure. Have any beowulfers had experience with deploying clusterswith redundant head nodes in a pseudo-H.A. fashion (heartbeatmonitoring, fail-over, etc.) and what experiences have you had inadapting your resource manager to this task? Would it simply be morefeasible to move the resource manager to another machine at this point(and have both headnodes act as submit and administrative clients)? Mycurrent plan is unfortunately light on the details of handling SGE insuch an environment. It includes purchasing two identical 1U boxes(with good support contracts). They will monitor each other foravailability and the goal is to have the spare take over if the masterfails. While the spare is not in use, I was planning on dispatchingjobs to it.There are a number of unfilled blanks in this plan currently (and I havea month with which to fine-tune the rest of this) and so if anyone wouldbe kind enough to offer suggestions on how to fill in a few I'dappreciate it.

Thanks to all in advance for any help!

Brian Smith


--
--------------------------------------------------------
+ Brian R. Smith                                       +
+ HPC Systems Analyst & Programmer                     +
+ Research Computing, University of South Florida      +
+ 4202 E. Fowler Ave. LIB618                           +
+ Office Phone: 1 (813) 974-1467                       +
+ Mobile Phone: 1 (813) 230-3441                       +
+ Organization URL: http://rc.usf.edu                  +
--------------------------------------------------------

_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Project Planning: Storage, Network, and Redundancy Considerations

Reply via email to