Thanks Bill. This is really helpful.
On Wed, 13 Dec 2006, Bill Broadley wrote:
What do you expect the I/O's to look like? Large file read/writes? Zillions of small reads/writes? To one file or directory or maybe to a file or directory per compute node?
We are basing our specs on large file use. The cluster is used for many things so I'm sure there will be some cases where small file writes will be done. Most of the work I do deals with large file reads and writes and that is what we are basing our desired performance on. I don't think we can afford to try to get this type of bandwidth for multiple small file writes.
My approach so far as been to buy N dual opterons with 16 disks in each (using the areca or 3ware controllers) and use NFS. Higher end 48 port switches come with 2-4 10G uplinks. Numerous disk setups these days can sustain 800MB/sec (Dell MD-1000 external array, Areca 1261ML, and the 3ware 9650SE) all of which can be had in a 15/16 disk configuration for $8-$14k depending on the size of your 16 disks (400-500GB towards the lower end, 750GB towards the higher end).
Do you have a system like this in place right now?
NFS would be easy, but any collection of clients (including all) would be performance limited by a single server.
This would be a problem, but...
PVFS2 or Lustre would allow you to use N of the above file servers and get not too much less than N times the bandwidth (assuming large sequential reads and writes).
... this sounds hopeful. How managable is this? Is it something that would take a FTE to keep going with 9 of these systems? I guess it depends on the systems themselves and how much fault tolerance there is.
In particular the Dell MD-1000 is interesting in that it allows for 2 12Gbit connections (via SAS), the docs I've found show you can access all 15 disks via a single connection or 7 disks on one, and 8 disks on the other. I've yet to find out if you can access all 15 disks via both interfaces to allow fallover in case one of your fileservers dies. As previously mentioned both PVFS2 and Lustre can be configured to handle this situation. So you could buy a pair of dual opterons + SAS card (with 2 external conenctions) then connect each port to each array (both servers to both connections), then if a single server fails the other can take over the other servers disks. A recent quote showed that for a config like this (2 servers 2 arrays) would cost around $24k. Assuming one spare disk per chassis, and a 12+2 RAID6 array and provide 12TB usable (not including 5% for filesystem overhead).
Are the 1 TB drives out now? With 750 GB drives wouldn't it be 9 TB per array. We have a 13+2 RAID6 + hot spare array with 750 GB drives and with XFS file system we get 8.9 TiB.
So 9 of the above = $216k and 108TB usable, each of the arrays Dell claims can manage 800MB/sec, things don't scale perfectly but I wouldn't be surprised to see 3-4GB/sec using PVFS2 or Lustre. Actual data points appreciated, we are interested in a 1.5-2.0GB/sec setup.
Based on 8.9 TiB above for 16 drives, it looks like 8.2 TiB for 15 drives. so we'd want 12 of these to get about 98 TiB usable storage. I don't know what the overhead is in PVFS2 or Lustre compared to XFS but I'd doubt it would be any less so we might even need 13.
So, 13 * $24K = $312K. Ah, what's another $100K.
Are any of the solutions you are considering cheaper than this? Any of the dual opterons in a 16 disk chassis could manage the same bandwidth (both 3ware and areca claim 800MB/sec or so), but could not survive a file server death.
So far this is the best price for something that can theoretically give the desired performance. I say theoretically here because I'm not sure what parts of this you have in place. I'm trying to find real-world implementations that provide in the ballpark of 5 to 10 MB/sec at the nodes when on the order of a hundred nodes are writing/reading at the same time.
Are you using PVFS2 or Lustre with your N Opteron servers? When you run a job with many nodes writing large files at the same time what kind of performance do you get per node? What is your value of N for the number of Opteron server/disk arrays you have implemented?
Thanks again for all of this information. I hadn't been thinking seriously of PVFS2 or Lustre because I'd been thinking more in the lines of individual disks in nodes. Using RAID arrays would be much more manageable. Are there others who have this type of system implemented who can provide performance results as well as a view on how manageable it is?
Thanks, Steve _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf