Re: [Beowulf] Software RAID?

Joe Landman Mon, 26 Nov 2007 19:13:06 -0800

Ekechi Nwokah wrote:

Reposting with (hopefully) more readable formatting.


[...]

Of course there are a zillion things you didn't mention. Howmany drives did you want to use? What kind? (SAS? SATA?) Ifyou want 16 drives often you get hardware RAID hardware evenif you don't use it.What config did you want?Raid-0? 1? 5? 6? Filesystem?
So let's say it's 16. But in theory it could be as high as 192. Using
multiple JBOD cards that present the drives individually (as separate
LUNs, for lack of a better term), and use software RAID to do all the
things that a 3ware/Areca, etc. card would do across the total span of
drives:

Hmmm... Anyone with a large disk count SW raid want to run a fewbonnie++ like loads on it and look at the interrupt/csw rates? Last Ilooked on a RAID0 (2 disk) we were seeing very high interrupt/csw rates.This would quickly swamp any perceived advantages of "infinitely many"or "infinitely fast" cores. Sort of like an Amdahl's law. Make theexpensive parallel computing portion take zero time, and you are stillstuck with the serial time (which you can't do much about). Worse, itis size extensive, so as you increase the number of disks, you have toincrease the interrupt rate (one controller per drive currently), andthe base SATA drivers seem to have a problem with lots of CSW.


RAID 0/1/5/6, etc., hotswap, SAS/SATA capability, etc.

Oh, and how do you measure performance?  Bandwidth?  Seeks?
Transactions?
Transaction size?  Mostly read? write?



All of the above. We would be max per-drive performance, say 70MB/s
reads with 100 IOPs on SATA, 120MB/s reads with 300 IOPs on SAS using 4k
transaction sizes. Hopefully eliminate any queueing bottlenecks on the
hardware RAID card.

This (queuing bottleneck) hasn't really been an issue in most of theworkloads we have seen. Has anyone seen this as an issue on theirworkloads?

Assume that we are using RDMA as the network transfer protocol so there
are no network interrupts on the cpus being used to do the XORs, etc.

er .... so your plan is to use something like a network with RDMA toattach the disks. So you are not using SATA controllers. You are usingnetwork controllers. With some sort of offload capability (RDMA withoutit is a little slow).

How does this save money/time again? You are replacing "expensive" RAIDcontrollers with "expensive" Network controller (unless you forgooffload, in which case RDMA doesn't make much sense)?

Which network were you planning on using for the disks? Gigabit? 10GbE? IB?

You sort-of have something like this today, in Coraid's AOE units. Ifyou don't have experience with them, you should ask about what happensto the user load under intensive IO operations. Note: there is nothingwrong with Coraid units, we like them (and in full disclosure, we doresell them, and happily connect them with our JackRabbit units).

Right now, all the hardware cards start to precipitously drop in
performance under concurrent access, particularly read/write mixes.

Hmmm.... Are there particular workloads you are looking at? Huge readswith a tiny write? Most of the RAID systems we have seen suffer fromsmall block random I/O. There your RAID system will get in the way (allthe extra seeks and computations will slow you down relative to singledisks). There you want RAID10's.

We have put our units (as well as software RAIDs) through some prettyhard tests: single RAID card feeding 4 simultaneous IOzone and bonnie++tests (each test 2x the RAM in the server box) through channel bondedquad gigabit. Apart from uncovering some kernel OOPses due to thechannel bond driver not liking really heavy loads, we sustained 360-390MB/s out of the box, with large numbers of concurrent reads and writes.We simply did not see degradation. Could you cite some materials Ican go look at, or help me understand which workloads you are talking about?

Areca is the best of the bunch, but it's not saying much compared to

Tier 1 storage ASICs/FPGAs.


You get what you pay for.

The idea here is twofold. Eliminate the cost of the hardware RAID and

I think you are going to wind up paying more than that cost in otherelements, such as networking, JBOD cards (good ones, not the crappydriver ones).

handle concurrent access accesses better. My theory is that 8 cores
would handle concurrent ARRAY access much better than the chipsets on
the hardware cards, and that if you did the parity calculations, CRC,
etc. using SSE instruction set you could acheive a high level of

parallelism and performance.

The parity calculations are fairly simple, and last I checked, at MDdriver startup, it *DOES* check which method makes the parity checkfastest in the md assemble stage. In fact, you can see, in the Linuxkernel source, SSE2, MMX, Altivec implementations of RAID6.Specifically, look at raid6sse2.c


/*
 * raid6sse2.c
 *
 * SSE-2 implementation of RAID-6 syndrome functions
 *
 */

You can see the standard calc, the unrolled by 2 calc, etc.

If this is limited by anything (just eyeballing it), it would be a) alack of functional units, b) SSE2 issue rate, c) SSE2 operand width.

Lack of functional units can sort of be handled by more cores. However,this code is assembly (in C) language. Parallel assembly programming isnot fun.

Moreover, OS jitter, context switching away from these calculations willbe *expensive* as you have to restore not just the full normal registerstack and frame, but all of the SSE2 registers. You would want to beable to dedicate entire cores to this, and isolate interrupt handling toother cores.

I just haven't seen something like that and I was not aware that md
could acheive anything close to the performance of a hardware RAID card
across a reasonable number of drives (12+), let alone provide the

feature set.

Due to SATA driver CSW/interrupt handling, I would be quite surprised ifit were able to do this (achieve similar performance). I would betperformance would top out below 8 drives. My own experience suggests 4drives. After that, you have to start spending money on those SATAcontrollers. And you will still be plagued by interrupts/CSW. Whichwill limit your performance. Your costs will start approaching the"expensive" RAID cards.

What we have found is, generally, performance on SATA is very much afunction of the quality of the driver, the implementation details of thecontroller, how it handles heavy IO (does it swamp the motherboard withinterrupts?). I have a SuperMicro 8 core deskside unit with a smallRAID0 on 3 drives. When I try to push the RAID0 hard, I swamp themotherboard with huge numbers of interrupts/CSW. Note that this is noteven doing RAID calculations, simply IO.

You are rate limited by how fast the underlying system can handle IO.The real value of any offload processor is how it, not so oddly enough,offloads stuff (calculations, interrupts, IO, ...) from the main CPUs.Some of the RAID cards for these units do a pretty good job ofoffloading, some are crap (and even with SW raid issues, it is fasterthan the crappy ones).


-- Ekechi


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Software RAID?

Reply via email to