On Mar 21, 2010, at 8:05 AM, Alexander Motin wrote:

> Ivan Voras wrote:
>> Julian Elischer wrote:
>>> You can get better throughput by using TSC for timing because the geom
>>> and devstat code does a bit of timing.. Geom can be told to turn off
>>> it's timing but devstat can't. The 170 ktps is with TSC as timer,
>>> and geom timing turned off.
>> 
>> I see. I just ran randomio on a gzero device and with 10 userland
>> threads (this is a slow 2xquad machine) I get g_up and g_down saturated
>> fast with ~~ 120 ktps. Randomio uses gettimeofday() for measurements.
> 
> I've just got 140Ktps from two real Intel X25-M SSDs on ICH10R AHCI
> controller and single Core2Quad CPU. So at least on synthetic tests it
> is potentially reachable even with casual hardware, while it completely
> saturated quad-core CPU.
> 
>> Hmm, it looks like it could be easy to spawn more g_* threads (and,
>> barring specific class behaviour, it has a fair chance of working out of
>> the box) but the incoming queue will need to also be broken up for
>> greater effect.
> 
> According to "notes", looks there is a good chance to obtain races, as
> some places expect only one up and one down thread.
> 

I agree that more threads just creates many more race complications.  Even if 
it didn't, the storage driver is a serialization point; it doesn't matter if 
you have a dozen g_* threads if only one of them can be in the top half of the 
driver at a time.  No amount of fine-grained locking is going to help this.

I'd like to go in the opposite direction.  The queue-dispatch-queue model of 
GEOM is elegant and easy to extend, but very wasteful for the simple case, 
where the simple case is one or two simple partition transforms (mbr, bsdlabel) 
and/or a simple stripe/mirror transform.  None of these need a dedicated 
dispatch context in order to operate.  What I'd like to explore is compiling 
the GEOM stack at creation time into a linear array of operations that happen 
without a g_down/g_up context switch.  As providers and consumers taste each 
other and build a stack, that stack gets compiled into a graph, and that graph 
gets executed directly from the calling context, both from the dev_strategy() 
side on the top and the bio_done() on the bottom.  GEOM classes that need a 
detached context can mark themselves as such, doing so will prevent a graph 
from being created, and the current dispatch model will be retained.

I expect that this will reduce i/o latency by a great margin, thus directly 
addressing the performance problem that FusionIO makes an example of.  I'd like 
to also explore having the g_bio model not require a malloc at every stage in 
the stack/graph; even though going through UMA is fairly fast, it still 
represents overhead that can be eliminated.  It also represents an 
out-of-memory failure case that can be prevented.

I might try to work on this over the summer.  It's really a research project in 
my head at this point, but I'm hopeful that it'll show results.

Scott

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to