http://axboe.livejournal.com/2258.html
May. 20th, 2009
01:43 pm - pdflush epitaph
It
seems it has been about 2 months since I last posted here. That's not
due to lack of kernel activity, though real life has interfered a bit
with the addition of one more son to the family.
The patch set
has been undergoing some changes since I last posted. One is the
ability to have more than one thread per backing device. This is
supposed to be useful for extreme cases where a single CPU cannot keep
up with a very fast device. I have yet to actually test this part, I'm
hoping some of the interested parties will join the fun and add the
file system related code that enables placement and flushing of dirty
inodes on several writeback threads per bdi. Another change is lazy
create/exit of flusher threads. pdflush has 2-8 threads running
depending on what mood it is in. With the per-bdi flusher threads, they
will not get created unless they are going to be working. If they have
been idle for some time, they will exit again. So this should more
smoothly respond to actual system demands, not much point in having 100
idle threads for 100 disks, if only a fraction of those disks are
actually writeback busy in a period of time.
I've also done a
bit of testing this week, results look pretty good. Most show the new
approach reaching similar performance but at a lower system utilization
percentage, or higher performance. So that's all good. Yanmin Zhang
(Intel) ran into a bug (that may or may not already be fixed, I'll know
tomorrow when tests are run with new patches) and posted a fio job file
that he reproduced it with. I decided to run the test with and without
the writeback patches to compare results. The disk used is an 32G Intel
X25-E SSD and the file system is ext4.
| Kernel |
Throughput
|
usr CPU |
sys CPU |
disk util |
| writeback |
175MB/sec |
17.55% |
43.04% |
97.80% |
| vanilla |
147MB/sec |
13.44% |
47.33% |
85.98% |
Pretty
decent result, I'd say. Apart from the lower system utilization, the
interesting bit is how the writeback patches actually enable us to keep
the disk busy. ~86% utilization for the vanilla kernel is pretty
depressing. The fio job file used was:
[global]
direct=0
ioengine=mmap
iodepth=256
iodepth_batch=32
size=1500M
bs=4k
pre_read=1
overwrite=1
numjobs=1
loops=5
runtime=600
group_reporting
directory=/data
[job_group0_sub0]
exec_prerun="echo 3 > /proc/sys/vm/drop_caches"
startdelay=0
rw=randwrite
filename=f1:f2
The
next few days will be spend with polishing the patch set and posting
version 5. That one should hopefully be ready for inclusion in the
-next tree, and then be headed upstream for 2.6.31.
Oh, and that
Intel disk kicks some serious ass. For sequential writes, it maintains
210MB/sec easily. I have a few OCZ Vertex disks as well which do pretty
well for sequential writes too, for random writes the Intel drive is in
a different league though. For my birthday, I want 4 more Intel disks
for testing!
Comments:
| From: |
diegocg |
| Date: |
May 20th, 2009 03:01 pm (UTC) |
|
|
|
(Link) |
|
Those
numbers are from a single disk? If per-bdi pdflushes are supposed to be
a feature for system with multiple devices, why it improves performance
on a single disk, maybe because with the current system the multiple
threads generate interleaved accesses? Or it's an effect of reduced CPU
usage? What is the expected results in non-SSD disks, will it improve
single-disk performance too, or just in systems with multiple
rotational disks?
(Reply) (Thread)
Mar. 12th, 2009
So
this week I began playing with implementing an alternative approach to
buffered writeback. Right now we have the pdflush threads taking care
of this, but that has a number of annoying points that made me want to
try something else. One is that writeout tends to be very lumpy, which
is easily visible in vmstat. Another is that it doesn't play well with
the request allocation scheme, since pdflush backs off when a queue
becomes congested. And fairness between congested users and blocking
users is... well not there.
Enter bdi threads. The first step
was moving the dirty inodes to some place where bdi threads could
easily get at them. So instead of putting them on the super_block
lists, we put them on the bdi lists of similar names. One upside of
this change is also that now we don't have to
do a linear search for
the bdi, we have it upfront. The next step is forking a thread per bdi
that does IO. My initial approach simply created a kernel thread when
the bdi was registered, but I'm sure that lots of people will find that
wasteful. So instead it now registers a forker thread on behalf of the
default backing device (default_backing_dev_info), which takes care of
creating the appropriate threads when someone calls
bdi_start_writeback() on a bdi. It'll handle memory pressure conditions
as well, find the details in the patch set.
Initial tests look
pretty good, though I haven't done a whole lot of testing on this yet.
It's still very fresh code. I posted it on lkml today, you can find the
individual patches and complete description there. As always, the
patches are also in my git repo. Find them in the writeback
branch here.
|