El Thu, 29 Dec 2011 01:25:36 +0100 Samuel Thibault <samuel.thiba...@gnu.org> escribió:
> It might be biaised by the clock not being able to tick everywhere in > the kernel (though I guess e.g. most of the IPC machinery is running > at ipl0?), but I believe it's still a bit revealing: I had already > noticed that ext2fs spends most of its time in the kernel (like 90%), > and it here seems we're spending a lot of time just managing the > ext2fs thread sleeps (no, there aren't many threads in that test, > just two dozen). It's good to see some real numbers, thanks Samuel. In fact, what for other systems is a (relatively) simple and straightforward operation (like read()/write()), for us it's (too much?) complex and could potentially involve several context changes. > while true; do rm -f blop ; \cp -f blip blop ; done In this example, "blip" is cached on the first run, so the rest of io_read requests can be solved without blocking. But with "blop", for each io_write: 1) Eventually the thread attending the io_write reaches pager_memcpy. When trying to touch the first page of the destination object, it faults, enters the kernel, sends a m_o_data_request message and waits for the page fault resolution at vm_fault_continue. 2) Other thread receives the m_o_data_request, and solves it. 3) The first thread briefly continues, sends a m_o_data_unlock message and waits sitting again at vm_fault_continue. 4) Other thread receives the m_o_data_unlock, and solves it. 5) The first thread continues, returns to user space, copies the data, and faults in the next page, returning to 2. 6) When there's no more data to copy, the first thread exits pager_memcpy and answers to its client. But, at this point, no data has been really written to disk. So, we also have the synchronization interval, which: 1) Iterates over all active pagers, locks them, and requests the return of all its dirty pages. 2) At the same time, a bunch of m_o_data_return messages are being received (at an arbitrary thread), due to the actions in 1). Those messages require to lock (again) the pager, and generate a lot of I/O (page by page). And we also have the "rm", which terminates the object, which in turn makes it compete with the synchronization interval at generating those m_o_data_return_messages. We have plenty of reasons for not being happy with this approach. In fact, in addition of poor performance, we suffer other problems that can be directly related with this, like thread storms, erratic pageouts, bad cache utilization, and multiple headaches dealing with locks :-) In memfs (which is still WIP, but it's able to do simple operations) I'm trying a different strategy: pagers will only be used for mmap'ed objects, and simple read/write operations to the backing store will solve the rest of request. This would require some kind of cache in libstore for the rest of FS, but it's not a problem for memfs, as it only works with memory.