http://lwn.net/Articles/326552/?format=printableFlushing out pdflush
The kernel page cache contains in-memory copies of data blocks
belonging to files kept in persistent storage. Pages which are written
to by a processor, but not yet written to disk, are
accumulated in cache and are known as "dirty" pages. The amount of
dirty memory is listed in /proc/meminfo. Pages in
the cache are flushed to disk after an interval of 30 seconds. Pdflush
is a set of kernel threads which are responsible for writing the
dirty pages to disk, either explicitly in response to a sync()
call, or
implicitly in cases when the page cache runs out of pages, if the
pages have been in memory for too long, or there are too many dirty
pages
in the page cache (as specified by /proc/sys/vm/diry_ratio).
At a given point of time, there are between two and eight pdflush threads running in the system. The number of pdflush threads is determined by the load on the page cache; new pdflush threads are spawned if none of the existing pdflush threads have been idle for more than one second, and there is more work in the pdflush work queue. On the other hand, if the last active pdflush thread has been asleep for more than one second, one thread is terminated. Termination of threads happens until only a minimum number of pdflush threads remain. The current number of running pdflush threads is reflected by /proc/sys/vm/nr_pdflush_threads. A number of pdflush-related issues have come to light over time. Pdflush threads are common to all block devices, but it is thought that they would perform better if they concentrated on a single disk spindle. Contention between pdflush threads is avoided through the use of the BDI_pdflush flag on the backing_dev_info structure, but this interlock can also limit writeback performance. Another issue with pdflush is request starvation. There is a fixed number of I/O requests available for each queue in the system. If the limit is exceeded, any application requesting I/O will block waiting for a new slot. Since pdflush works on several queues, it cannot block on a single queue. So, it sets the wbc->nonblocking writeback information flag. If other applications continue to write on the device, pdflush will not succeed in allocating request slots. This may lead to starvation of access to the queue, if pdflush repeatedly finds the queue congested. Jens Axboe in his patch set proposes a new idea of using flusher threads per backing device info (BDI), as a replacement for pdflush threads. Unlike pdflush threads, per-BDI flusher threads focus on a single disk spindle. With per-BDI flushing, when the request_queue is congested, blocking happens on request allocation, avoiding request starvation and providing better fairness. With pdflush, The dirty inode list is stored by the super block of the filesystem. Since the per-BDI flusher needs to be aware of the dirty pages to be written by its assigned device, this list is now stored by the BDI. Calls to flush dirty inodes on the superblock result in flushing the inodes from the list of dirty inodes on the backing device for all devices listed for the filesystem. As with pdflush, per-BDI writeback is controlled through the writeback_control data structure, which instructs the writeback code what to do, and how to perform the writeback. The important fields of this structure are:
The struct bdi_writeback keeps all information required for flushing the dirty pages: struct bdi_writeback {
struct backing_dev_info *bdi;
unsigned int nr;
struct task_struct *task;
wait_queue_head_t wait;
struct list_head b_dirty;
struct list_head b_io;
struct list_head b_more_io;
unsigned long nr_pages;
struct super_block *sb;
};
The bdi_writeback structure is initialized when the device is registered through bdi_register(). The fields of the bdi_writeback are:
nr_pages and sb are parameters passed asynchronously to the the BDI flush thread, and are not fixed through the life of the bdi_writeback. This is done to facilitate devices with multiple filesystem, hence multiple super_blocks. With multiple super_blocks on a single device, a sync can be requested for a single filesystem on the device. The bdi_writeback_task() function waits for the dirty_writeback_interval, which by default is 5 seconds, and initiates wb_do_writeback(wb) periodically. If there are no pages written for five minutes, the flusher thread exits (with a grace period of dirty_writeback_interval). If a writeback work is later required (after exit), new flusher threads are spawned by the default writeback thread. Writeback flushes are done in two ways:
After review of the first attempt, Jens added functionality of having multiple flusher threads per device based on the suggestions of Andrew Morton. Dave Chinner suggested that filesystems would like to have a flusher thread per allocation group. In the patch set (second iteration) which followed, Jens added a new interface in the superblock to return the bdi_writeback structure associated with the inode: struct bdi_writeback *(*inode_get_wb) (struct inode *); If inode_get_wb is NULL, the default bdi_writeback of the BDI is returned, which means there is only one bdi_writeback thread for the BDI. The maximum number of threads that can be started per BDI is 32. Initial experiments conducted by Jens found an 8% increase in performance on a simple SATA drive running Flexible File System Benchmark (ffsb). File layout was smoother as compared to the vanilla kernel as reported by vmstat, with a uniform distribution of buffers written out. With a ten-disk btrfs filesystem, per-BDI flushing performed 25% faster. The writeback is tracked by Jens's block layer git tree (git://git.kernel.dk/linux-2.6-block.git) under the "writeback" branch. There have been no comments on the second iteration so far, but per-BDI flusher threads is still not ready enough to go into the 2.6.30 tree. Acknowledgments: Thanks to Jens Axboe for reviewing and explaining certain aspects of the patch set.
Flushing out pdflush Posted Apr 4, 2009 20:17 UTC (Sat) by mjcoder (subscriber, #54432) [Link] > With a
ten-disk btrfs filesystem, per-BDI flushing performed 25% faster
Wow, this is a massive improvement!
Flushing out pdflush Posted Apr 4, 2009 20:59 UTC (Sat) by nix (subscriber, #2304) [Link] Yeah, but how many new kernel threads do
we end up with? We already have a
crazy number.
Flushing out pdflush Posted Apr 5, 2009 0:09 UTC (Sun) by knobunc (subscriber, #4678) [Link] I thought that too... but what's the
downside to more threads?
Flushing out pdflush Posted Apr 5, 2009 11:56 UTC (Sun) by nix (subscriber, #2304) [Link] Inelegance? 12K unswappable kernel memory
per thread?
A thread pool would surely be better (for something like this, anyway,
in
Flushing out pdflush Posted Apr 5, 2009 14:27 UTC (Sun) by i3839 (subscriber, #31386) [Link] It seems that by default you will have
one thread per block device,
which seems totally reasonable, no matter how many block devices you
have. Reducing this number to less than that will mean you will be able
to write data out slower, because you can't keep all devices busy.
Except if you switch to a non-blocking multiplexing thread doing all
write-outs, but there's probably a good reason why that's not done. For
fast devices it may be better to increase the number of threads, but
again, not doing that will result in slower write-out throughput.
A thread pool would only be better if you would otherwise allocate too
many threads per device. But if you allocated too many, there's nothing
that prevents having a too high number of threads in the pool either,
so it's just shuffling the problem around.
Flushing out pdflush Posted Apr 5, 2009 18:27 UTC (Sun) by nix (subscriber, #2304) [Link] Most block devices are never or very very
rarely written to (e.g. one
containing only /usr is only going to be written to during a package upgrade). Why devote a thread to it which will be idle nearly all the time?
Flushing out pdflush Posted Apr 6, 2009 11:44 UTC (Mon) by i3839 (subscriber, #31386) [Link] The thread is only created when needed
and exists after a time-out, IIRC.
Flushing out pdflush Posted Apr 6, 2009 19:31 UTC (Mon) by nix (subscriber, #2304) [Link] So it is, in effect, a thread pool.
Excellent.
Flushing out pdflush Posted Apr 17, 2009 12:41 UTC (Fri) by axboe (subscriber, #904) [Link] Yes that is correct, threads are created
on-demand and exit if they have been idle for some period of time.
It's mostly to satisfy the people getting annoyed when looking at the
ps output. An idle kernel thread is essentially free. Of course if you
have 10000 logical disks in your system, you'll probably appreciate not
spending memory on the threads at least.
Flushing out pdflush Posted Apr 17, 2009 22:29 UTC (Fri) by nix (subscriber, #2304) [Link] Last I heard the set of PIDs was limited
to 32767 by default, and having
huge numbers of even idle processes tended to slow down the PID allocator horribly.
(Also, kernel threads still need a kernel stack: that's 8Kb of memory
you
Flushing out pdflush Posted Apr 17, 2009 23:43 UTC (Fri) by njs (subscriber, #40338) [Link] The PID allocator supposedly got fixed a
few years ago, around when
NPTL landed. (Here's an interview with Ingo that confirms this: http://kerneltrap.org/node/517)
And if the kernel needed thousands of threads for some reason,
presumably it could tweak the kernel.pid_max sysctl itself...
But anyway, yeah, for ordinary systems the memory usage matters a
little but not much.
Flushing out pdflush Posted Apr 7, 2009 20:08 UTC (Tue) by im14u2c (subscriber, #5246) [Link] Hmmm...At a given point of time, there are between two and eight pdflush threads running in the system.vs. With a ten-disk btrfs filesystem, per-BDI flushing performed 25% fasterCould a noticeable part of the 25% boost be attributed to a 25% boost in number of flushing threads? A 10-disk btrfs filesystem ought to be generating traffic on all 10 spindles, right?
Flushing out pdflush Posted Apr 17, 2009 12:39 UTC (Fri) by axboe (subscriber, #904) [Link] Actually, no. With btrfs, currently it assigns a per-fs backing device to each inode. So for that particular case, you have just a single bdi flusher thread running even for the 10 disks. |
