On Thu, Jun 14, 2012 at 12:19:24PM +0200, Mark Kettenis wrote: > > Date: Thu, 14 Jun 2012 08:58:25 +0200 > > From: Alexandre Ratchov <a...@caoua.org> > > > > On Thu, Jun 14, 2012 at 03:48:12AM +0200, Ariane van der Steldt wrote: > > > Hi, > > > > > > This diff implements yielding in the vm system, with the intention to > > > make expensive system calls preempt every once in a while, to allow > > > userland system calls to progress. > > > > > > It's a very conservative application at the moment, only interleaving > > > during memory unmapping. If this works, we can start adding the > > > mechanic to other expensive system calls not just in uvm, but anywhere > > > the kernel is kept locked for long stretches of time. > > > > > > > FWIW, few years ago I did a lot of tests and munmap() was by far > > the longest kernel code path. IIRC the second was ffs w/ softdeps. > > > > > Technically the diff is incomplete, since it does not consider the very > > > big problem of the biglock. And that's because the biglock lacks a > > > __mp_lock_is_contended() call: if we hold the biglock and another cpu > > > wants the biglock, ideally it will inform us so we can sidestep and > > > temporarily grant access. > > > > I've read the diff, but I still don't understand this part. Why > > wouldn't the naive approch work for MP as well? I mean, on UP > > systems we do: > > > > for (;;) { > > one_cycle_of_heavy_stuff(); > > preemp(); > > } > > > > and preemp() would check for a flag set in the clock interrupt and > > possibly call mi_switch() to run another process. On MP systems, > > mi_switch() would release the kernel lock, wouldn't it? If so, > > other processes can invoke syscalls. In this case I don't see the > > difference between MP and UP. > > Indeed. Except that we probably should yield() instead of preempt() > since this is a voluntary context switch. At least that is what the > pool subsystem does. > > The hard bit is to determine how much work to do in one cycle. Should > be enough work to still spend a significant fraction of a timeslice > otherwise unmapping stuff becomes too slow.
Yes, the amount of work may not be too small. I've considered this and decided to solve it by calling the preempt conditionally: only context switch if: - the process is told a context switch needs to happen - we're the reaper and something else on this cpu wants to run This basically keeps us from switching unless there's a good reason, like curproc having used up its entire time slice or a higher priority process became runnable (the former happening only every 100ms, the latter happening only a limited amount of time: worst case every process will become blocked on curproc and curproc completes, unblocking them all). If neither is true, we don't preempt (thus not calling mpswitch(), not releasing the kernel lock etc etc). For this reason it doesn't work well on SMP. Unfortunately the above conditions may be false if for ex 2 processes are spread out across 2 cpus: the cpu with the biglock will not mark the proc as need-to-switch (because it's the only process on the runqueue, iirc there's an optimization for that). And thus we don't switch and never detect that the other cpu is hopelessly spinning because the other proc wants to enter a system call. That's why I want to detect mplock contention: that would take care of other cpus needing to enter kernel. -- Ariane