On Sun 16-07-17 19:59:51, Tetsuo Handa wrote:
> Since the whole memory reclaim path has never been designed to handle the
> scheduling priority inversions, those locations which are assuming that
> execution of some code path shall eventually complete without using
> synchronization mechanisms can get stuck (livelock) due to scheduling
> priority inversions, for CPU time is not guaranteed to be yielded to some
> thread doing such code path.
> 
> mutex_trylock() in __alloc_pages_may_oom() (waiting for oom_lock) and
> schedule_timeout_killable(1) in out_of_memory() (already held oom_lock) is
> one of such locations, and it was demonstrated using artificial stressing
> that the system gets stuck effectively forever because SCHED_IDLE priority
> thread is unable to resume execution at schedule_timeout_killable(1) if
> a lot of !SCHED_IDLE priority threads are wasting CPU time [1].
> 
> To solve this problem properly, complete redesign and rewrite of the whole
> memory reclaim path will be needed. But we are not going to think about
> reimplementing the the whole stack (at least for foreseeable future).
> 
> Thus, this patch workarounds livelock by forcibly yielding enough CPU time
> to the thread holding oom_lock by using mutex_lock_killable() mechanism,
> so that the OOM killer/reaper can use CPU time yielded by this patch.
> Of course, this patch does not help if the cause of lack of CPU time is
> somewhere else (e.g. executing CPU intensive computation with very high
> scheduling priority), but that is not fault of this patch.
> This patch only manages not to lockup if the cause of lack of CPU time is
> direct reclaim storm wasting CPU time without making any progress while
> waiting for oom_lock.

I have to think about this some more. Hitting much more on the oom_lock
is a problem while __oom_reap_task_mm still depends on the oom_lock. With
http://lkml.kernel.org/r/[email protected] it
doesn't do anymore.

Also this whole reasoning is little bit dubious to me. The whole reclaim
stack might still preempt the holder of the lock so you are addressin
only a very specific contention case where everybody hits the oom. I
suspect that a differently constructed testcase might result in the same
problem.
 
> [1] 
> http://lkml.kernel.org/r/[email protected]
> 
> Signed-off-by: Tetsuo Handa <[email protected]>
> ---
>  mm/page_alloc.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 80e4adb..622ecbf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3259,10 +3259,12 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
> const char *fmt, ...)
>       *did_some_progress = 0;
>  
>       /*
> -      * Acquire the oom lock.  If that fails, somebody else is
> -      * making progress for us.
> +      * Acquire the oom lock. If that fails, somebody else should be making
> +      * progress for us. But if many threads are doing the same thing, the
> +      * owner of the oom lock can fail to make progress due to lack of CPU
> +      * time. Therefore, wait unless we get SIGKILL.
>        */
> -     if (!mutex_trylock(&oom_lock)) {
> +     if (mutex_lock_killable(&oom_lock)) {
>               *did_some_progress = 1;
>               schedule_timeout_uninterruptible(1);
>               return NULL;
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

Reply via email to