We are currently running GigE as our interconnect.

did you mention the kind of compute/client load you've got?

Basically we are currently running two NFS servers out to our web
servers.

uh, that sounds fine - web traffic tends to be quite read-cache
friendly, which NFS does very nicely.

We also are running three MySQL servers. The MySQL instances
are segmented right now, but we are about to start an eval of
Continuent's M//Cluster software.

have you measured the nature of your NFS and SQL loads?

As stated, our FS infrasdtructure leaves much to be desired.  The
current setup involving NFS servers (Dell PE 2850 with local 1TB local
storage 10K scsi disks) have not performed well.  We are constantly IO
waiting.

but _why_? heavy write load without async NFS (and writeback at the block level)? with multiple local 10K scsi disks, you really shouldn't
be seek limited, especially if requests are coming over just gigabit.

Another interesting thing is, each MySQL server is using a ISCSI block
device from SATAII NAS servers that we built using generic super micro
boards and Areca controllers.  Each of these boxes has approx 2.1TB of
usable disk, and the performance has been suprisingly good.  The Areca
1160 controllers with 1GB cache are handling the load, especially
compared to our FS infrastructure of localized disks (I would have
thought the opposite would be true),

to me that indicates your disk-local servers are misconfigured.
(which reminds me - dell has shipped some _astoundingly_ bad raid systems
marketed as high-end...)

as the mysql disk IO pattern
would be more smaller random IO, and the FS is mostly read (serving up
web pages).

but web pages will normally be nicely read-cached on the web frontends...

We have made pretty much every last ounce of optimization we can on
the NFS side (TCP, packet sizes, hugemem kernels, tried David Howells
fscache on web client side) but non has been the silver bullet we've
been looking for, which led us down the parallel fs path.

how much memory do the web servers have?  if the bottleneck IO really
is mostly-read pages, then local dram will help a lot.

on Cluster FS that we seek to employ.. Yet in an effort to scale to
the sky, we are going to try to do this correctly, rather than
continually being reactive.

not to insult, but I find that the main problem is not understanding the workload sufficiently, not lapses in proactivity...
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to