Ekechi Nwokah wrote:
Reposting with (hopefully) more readable formatting.

[...]

Of course there are a zillion things you didn't mention. How many drives did you want to use? What kind? (SAS? SATA?) If you want 16 drives often you get hardware RAID hardware even if you don't use it. What config did you want? Raid-0? 1? 5? 6? Filesystem?


So let's say it's 16. But in theory it could be as high as 192. Using
multiple JBOD cards that present the drives individually (as separate
LUNs, for lack of a better term), and use software RAID to do all the
things that a 3ware/Areca, etc. card would do across the total span of
drives:

Hmmm... Anyone with a large disk count SW raid want to run a few bonnie++ like loads on it and look at the interrupt/csw rates? Last I looked on a RAID0 (2 disk) we were seeing very high interrupt/csw rates. This would quickly swamp any perceived advantages of "infinitely many" or "infinitely fast" cores. Sort of like an Amdahl's law. Make the expensive parallel computing portion take zero time, and you are still stuck with the serial time (which you can't do much about). Worse, it is size extensive, so as you increase the number of disks, you have to increase the interrupt rate (one controller per drive currently), and the base SATA drivers seem to have a problem with lots of CSW.


RAID 0/1/5/6, etc., hotswap, SAS/SATA capability, etc.

Oh, and how do you measure performance?  Bandwidth?  Seeks?
Transactions?
Transaction size?  Mostly read? write?



All of the above. We would be max per-drive performance, say 70MB/s
reads with 100 IOPs on SATA, 120MB/s reads with 300 IOPs on SAS using 4k
transaction sizes. Hopefully eliminate any queueing bottlenecks on the
hardware RAID card.

This (queuing bottleneck) hasn't really been an issue in most of the workloads we have seen. Has anyone seen this as an issue on their workloads?

Assume that we are using RDMA as the network transfer protocol so there
are no network interrupts on the cpus being used to do the XORs, etc.

er .... so your plan is to use something like a network with RDMA to attach the disks. So you are not using SATA controllers. You are using network controllers. With some sort of offload capability (RDMA without it is a little slow).

How does this save money/time again? You are replacing "expensive" RAID controllers with "expensive" Network controller (unless you forgo offload, in which case RDMA doesn't make much sense)?

Which network were you planning on using for the disks? Gigabit? 10 GbE? IB?

You sort-of have something like this today, in Coraid's AOE units. If you don't have experience with them, you should ask about what happens to the user load under intensive IO operations. Note: there is nothing wrong with Coraid units, we like them (and in full disclosure, we do resell them, and happily connect them with our JackRabbit units).

Right now, all the hardware cards start to precipitously drop in
performance under concurrent access, particularly read/write mixes.

Hmmm.... Are there particular workloads you are looking at? Huge reads with a tiny write? Most of the RAID systems we have seen suffer from small block random I/O. There your RAID system will get in the way (all the extra seeks and computations will slow you down relative to single disks). There you want RAID10's.

We have put our units (as well as software RAIDs) through some pretty hard tests: single RAID card feeding 4 simultaneous IOzone and bonnie++ tests (each test 2x the RAM in the server box) through channel bonded quad gigabit. Apart from uncovering some kernel OOPses due to the channel bond driver not liking really heavy loads, we sustained 360-390 MB/s out of the box, with large numbers of concurrent reads and writes. We simply did not see degradation. Could you cite some materials I can go look at, or help me understand which workloads you are talking about?

Areca is the best of the bunch, but it's not saying much compared to
Tier 1 storage ASICs/FPGAs.

You get what you pay for.


The idea here is twofold. Eliminate the cost of the hardware RAID and

I think you are going to wind up paying more than that cost in other elements, such as networking, JBOD cards (good ones, not the crappy driver ones).

handle concurrent access accesses better. My theory is that 8 cores
would handle concurrent ARRAY access much better than the chipsets on
the hardware cards, and that if you did the parity calculations, CRC,
etc. using SSE instruction set you could acheive a high level of
parallelism and performance.

The parity calculations are fairly simple, and last I checked, at MD driver startup, it *DOES* check which method makes the parity check fastest in the md assemble stage. In fact, you can see, in the Linux kernel source, SSE2, MMX, Altivec implementations of RAID6. Specifically, look at raid6sse2.c

/*
 * raid6sse2.c
 *
 * SSE-2 implementation of RAID-6 syndrome functions
 *
 */

You can see the standard calc, the unrolled by 2 calc, etc.

If this is limited by anything (just eyeballing it), it would be a) a lack of functional units, b) SSE2 issue rate, c) SSE2 operand width.

Lack of functional units can sort of be handled by more cores. However, this code is assembly (in C) language. Parallel assembly programming is not fun.

Moreover, OS jitter, context switching away from these calculations will be *expensive* as you have to restore not just the full normal register stack and frame, but all of the SSE2 registers. You would want to be able to dedicate entire cores to this, and isolate interrupt handling to other cores.

I just haven't seen something like that and I was not aware that md
could acheive anything close to the performance of a hardware RAID card
across a reasonable number of drives (12+), let alone provide the
feature set.

Due to SATA driver CSW/interrupt handling, I would be quite surprised if it were able to do this (achieve similar performance). I would bet performance would top out below 8 drives. My own experience suggests 4 drives. After that, you have to start spending money on those SATA controllers. And you will still be plagued by interrupts/CSW. Which will limit your performance. Your costs will start approaching the "expensive" RAID cards.

What we have found is, generally, performance on SATA is very much a function of the quality of the driver, the implementation details of the controller, how it handles heavy IO (does it swamp the motherboard with interrupts?). I have a SuperMicro 8 core deskside unit with a small RAID0 on 3 drives. When I try to push the RAID0 hard, I swamp the motherboard with huge numbers of interrupts/CSW. Note that this is not even doing RAID calculations, simply IO.

You are rate limited by how fast the underlying system can handle IO. The real value of any offload processor is how it, not so oddly enough, offloads stuff (calculations, interrupts, IO, ...) from the main CPUs. Some of the RAID cards for these units do a pretty good job of offloading, some are crap (and even with SW raid issues, it is faster than the crappy ones).




-- Ekechi


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to