Thanks Bill.  This is really helpful.

On Wed, 13 Dec 2006, Bill Broadley wrote:

What do you expect the I/O's to look like?  Large file read/writes?  Zillions
of small reads/writes?  To one file or directory or maybe to a file or
directory per compute node?

We are basing our specs on large file use. The cluster is used for many things so I'm sure there will be some cases where small file writes will be done. Most of the work I do deals with large file reads and writes and that is what we are basing our desired performance on. I don't think we can afford to try to get this type of bandwidth for multiple small file writes.

My approach so far as been to buy N dual opterons with 16 disks in each
(using the areca or 3ware controllers) and use NFS.  Higher end 48 port
switches come with 2-4 10G uplinks.  Numerous disk setups these days
can sustain 800MB/sec (Dell MD-1000 external array, Areca 1261ML, and the
3ware 9650SE) all of which can be had in a 15/16 disk configuration for
$8-$14k depending on the size of your 16 disks (400-500GB towards the lower
end, 750GB towards the higher end).

Do you have a system like this in place right now?

NFS would be easy, but any collection of clients (including all) would be
performance limited by a single server.

This would be a problem, but...

PVFS2 or Lustre would allow you to use N of the above file servers and
get not too much less than N times the bandwidth (assuming large sequential
reads and writes).

... this sounds hopeful. How managable is this? Is it something that would take a FTE to keep going with 9 of these systems? I guess it depends on the systems themselves and how much fault tolerance there is.


In particular the Dell MD-1000 is interesting in that it allows for 2 12Gbit
connections (via SAS), the docs I've found show you can access all 15
disks via a single connection or 7 disks on one, and 8 disks on the other.
I've yet to find out if you can access all 15 disks via both interfaces
to allow fallover in case one of your fileservers dies.  As previously
mentioned both PVFS2 and Lustre can be configured to handle this situation.

So you could buy a pair of dual opterons + SAS card (with 2 external
conenctions) then connect each port to each array (both servers to
both connections), then if a single server fails the other can take
over the other servers disks.

A recent quote showed that for a config like this (2 servers 2 arrays) would
cost around $24k.  Assuming one spare disk per chassis, and a 12+2 RAID6 array
and provide 12TB usable (not including 5% for filesystem overhead).

Are the 1 TB drives out now? With 750 GB drives wouldn't it be 9 TB per array. We have a 13+2 RAID6 + hot spare array with 750 GB drives and with XFS file system we get 8.9 TiB.

So 9 of the above = $216k and 108TB usable, each of the arrays Dell claims
can manage 800MB/sec, things don't scale perfectly but I wouldn't be surprised
to see 3-4GB/sec using PVFS2 or Lustre.  Actual data points appreciated, we
are interested in a 1.5-2.0GB/sec setup.

Based on 8.9 TiB above for 16 drives, it looks like 8.2 TiB for 15 drives. so we'd want 12 of these to get about 98 TiB usable storage. I don't know what the overhead is in PVFS2 or Lustre compared to XFS but I'd doubt it would be any less so we might even need 13.

So, 13 * $24K = $312K.  Ah, what's another $100K.

Are any of the solutions you are considering cheaper than this?  Any of the
dual opterons in a 16 disk chassis could manage the same bandwidth (both 3ware
and areca claim 800MB/sec or so), but could not survive a file server death.

So far this is the best price for something that can theoretically give the desired performance. I say theoretically here because I'm not sure what parts of this you have in place. I'm trying to find real-world implementations that provide in the ballpark of 5 to 10 MB/sec at the nodes when on the order of a hundred nodes are writing/reading at the same time.

Are you using PVFS2 or Lustre with your N Opteron servers? When you run a job with many nodes writing large files at the same time what kind of performance do you get per node? What is your value of N for the number of Opteron server/disk arrays you have implemented?

Thanks again for all of this information. I hadn't been thinking seriously of PVFS2 or Lustre because I'd been thinking more in the lines of individual disks in nodes. Using RAID arrays would be much more manageable. Are there others who have this type of system implemented who can provide performance results as well as a view on how manageable it is?

Thanks,

Steve
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to