[Kernel-packages] [Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED

Rafael David Tinoco Thu, 14 Jan 2016 18:05:59 -0800

Disconsider my previous analysis please. After disassembling carefully the java
process dump I could see that we had the instruction pointer at:


    if (iter.task)
        put_task_struct(iter.task);

Some instructions before a loop that could be a potential candidate for 
deadlocks, which
I was referring to before.

   0xffffffff811df242 <+34>:    je     0xffffffff811df24f <next_tgid+47>
   0xffffffff811df244 <+36>:    lock decl 0x10(%rdx)
   0xffffffff811df248 <+40>:    sete   %al
   0xffffffff811df24b <+43>:    test   %al,%al

The combination of those 3 instructions (decl, sete & test) shows me that this 
is
actually the implementation of atomic_dec_and_test function. It servers to 
atomically
(CPU coordination) decrement a variable from task_struct for the given task.

Deadlock is probably NOT there.

I went for khungtask stack to check which task it was complaining about being 
locked
and I could not find it (probably taskid was in registers and not on the stack) 
BUT
I could find an output message telling me which task was locked:

[15956.371298] INFO: task jsvc:57263 blocked for more than 120 seconds.

And this comes exactly from khungtaskd code. It means that PID 57263 was NOT
scheduled for more than 120 seconds. Checking its stack:

    PID: 57263
COMMAND: "jsvc"
   TASK: ffff883e0c4e2e00  [THREAD_INFO: ffff883e849c2000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE|TASK_TRACED|EXIT_DEAD

I can see that it is in UNINTERRUPTIBLE state (it can't received signals, for 
ex).
Checking its stack:

crash> bt
PID: 57263  TASK: ffff883e0c4e2e00  CPU: 1   COMMAND: "jsvc"
 #0 [ffff883e849c3c88] __schedule at ffffffff8166256a
 #1 [ffff883e849c3d10] schedule at ffffffff81662c0f
 #2 [ffff883e849c3d20] schedule_timeout at ffffffff8166324d
 #3 [ffff883e849c3dd0] wait_for_common at ffffffff81662a4f
 #4 [ffff883e849c3e70] wait_for_completion at ffffffff81662bcd
 #5 [ffff883e849c3e80] flush_work at ffffffff8108757e
 #6 [ffff883e849c3ed0] schedule_on_each_cpu at ffffffff81087813
 #7 [ffff883e849c3f10] lru_add_drain_all at ffffffff81127365
 #8 [ffff883e849c3f20] sys_fadvise64_64 at ffffffff8111e189
 #9 [ffff883e849c3f70] sys_fadvise64 at ffffffff8111e27e
#10 [ffff883e849c3f80] system_call_fastpath at ffffffff8166d2c2

## sys_fadvise64_64

The task entered kernel mode after the syscall: fadvise. It basically tells the 
kernel
how the file is going to be accessed so it can prepare itself regarding to 
different
caching techniques (expect page references in random order, expect page 
references
in sequential order, expect delay in accessing file contents, etc).

It turns out that the JSVC process basically told kernel that it would NOT use 
the
opened file (flag POSIX_FADV_DONTNEED) until further moment, at a certain 
offset. With
that, kernel calculated page start/end index (offset, offset+len) for the 
memory range
of given file and tried to invalidate all unlocked pages (page cache) for that 
area.
Dirty, locked, write-back or mapped pages are NOT invalidated.

After first attempt of invalidating pages from page-cache from that particular 
file,
Kernel found out that fewer pages (than the ones given to fadvise - in the form 
of
offset/len delta) than expected were invalidated (in other words, it could not 
invalidate
all pages from the the range) so it resolved calling the LRU DRAIN logic.

## lru_add_drain_all

LRU stands for Least-Recent-Used algorithm and is used for user mode address 
space and
for page cache (this case). It defines 2 lists of memory pages: active and 
inactive. The
LRU DRAIN logic schedules a drain logic on every CPU on the system in order to 
drain
per-cpu existent page vectors (pagevecs).

Page vecs are temporary small & finite data structures containing page 
descriptor pointers.
Instead of changing LRU lists (active/inactive) directly, the LRU algorithm 
uses per-cpu
page vectors until they are full (avoiding having to lock/unlock while LRU 
lists more often).

## schedule_on_each_cpu

Task ffff883e0c4e2e00 continued the LRU DRAIN logic by scheduling its call back 
function using
schedule_on_each_cpu:

schedule_on_each_cpu(drain_cpu_pagevecs) scheduled the drain_cpu_pagevecs on 
all cpus
and kept waiting - for more than 120 seconds - for all cpus to execute the 
drain.

While it was waiting, the task ffff883e0c4e2e00 was made UNINTERRUPTIBLE and 
was locked
inside this loop where it waits for drain_cpu_pagevecs to finish on all cpus 
(it never did).

## precise commit (cherry-picked from upstream) adding that logic:

commit a68692ea42825466b0ade8a94a98afc924d2564e
Author: Mel Gorman <mgor...@suse.de>
Date:   Fri Feb 22 16:35:59 2013 -0800

    mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to
discard all pages

    BugLink: http://bugs.launchpad.net/bugs/1150557

    commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 upstream.

It was added to Precise (3.2.0-40.64).

I'm removing that logic and providing a kernel without it to see if it
mitigates the issue.

Per original commit description, this was supposed to fix an erratic behaviour 
seen in
s390x architecture for the sys_fadvise64_64 system call. I'm afraid it 
introduced a race
condition that might have disappeared because of code refactoring in further 
commits.
If not I'll check if the logic is still plausible for upstream and discuss it 
appropriately.

I'll provide a hotfixed kernel which I would like to get feedback about.

Thank you

Rafael

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1534413

Title:
  Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED

Status in linux package in Ubuntu:
  In Progress

Bug description:
  It was brought to my knowledge a kernel dump (3.2.0-79) with the
  following stack trace:

  """
  [14277.952072] usb 3-1: USB disconnect, device number 3
  [15388.602790] NOHZ: local_softirq_pending 02
  [15404.593795] NOHZ: local_softirq_pending 02
  [15436.575787] NOHZ: local_softirq_pending 02
  [15452.566802] NOHZ: local_softirq_pending 02
  [15456.564528] NOHZ: local_softirq_pending 02
  [15564.503842] NOHZ: local_softirq_pending 02
  [15584.492538] NOHZ: local_softirq_pending 02
  [15588.490302] NOHZ: local_softirq_pending 02
  [15632.465563] NOHZ: local_softirq_pending 02
  [15659.014629] NOHZ: local_softirq_pending 02
  [15956.371298] INFO: task jsvc:57263 blocked for more than 120 seconds.
  [15956.375347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [15956.383487] jsvc            D ffffffff81806200     0 57263   9495 
0x00000000
  [15956.383493]  ffff883e849c3d08 0000000000000082 ffff883e849c3ca8 
000000008104f624
  [15956.383502]  ffff883e849c3fd8 ffff883e849c3fd8 ffff883e849c3fd8 
0000000000012800
  [15956.383509]  ffff881f72db9700 ffff883e0c4e2e00 ffff883e849c3cf8 
7fffffffffffffff
  [15956.383518] Call Trace:
  [15956.383534]  [<ffffffff81662c0f>] schedule+0x3f/0x60
  [15956.383539]  [<ffffffff8166324d>] schedule_timeout+0x29d/0x310
  [15956.383547]  [<ffffffff810388be>] ? physflat_send_IPI_mask+0xe/0x10
  [15956.383554]  [<ffffffff81032068>] ? native_smp_send_reschedule+0x48/0x60
  [15956.383560]  [<ffffffff8103ec29>] ? default_spin_lock_flags+0x9/0x10
  [15956.383564]  [<ffffffff81662a4f>] wait_for_common+0xdf/0x180
  [15956.383572]  [<ffffffff81060ac0>] ? try_to_wake_up+0x200/0x200
  [15956.383576]  [<ffffffff81662bcd>] wait_for_completion+0x1d/0x20
  [15956.383585]  [<ffffffff8108757e>] flush_work+0x2e/0x40
  [15956.383589]  [<ffffffff810838b0>] ? wake_up_worker+0x30/0x30
  [15956.383593]  [<ffffffff81087813>] schedule_on_each_cpu+0xc3/0x110
  [15956.383602]  [<ffffffff81127365>] lru_add_drain_all+0x15/0x20
  [15956.383607]  [<ffffffff8111e189>] sys_fadvise64_64+0x189/0x270
  [15956.383610]  [<ffffffff8111e27e>] sys_fadvise64+0xe/0x10
  [15956.383619]  [<ffffffff8166d2c2>] system_call_fastpath+0x16/0x1b
  [15956.383622] Kernel panic - not syncing: hung_task: blocked tasks
  [15956.388083] Pid: 178, comm: khungtaskd Tainted: G        W    
3.2.0-79-generic #115-Ubuntu
  [15956.397273] Call Trace:
  [15956.401783]  [<ffffffff8164c005>] panic+0x91/0x1a4
  [15956.406527]  [<ffffffff810d97c2>] check_hung_task+0xb2/0xc0
  [15956.411393]  [<ffffffff810d98eb>] 
check_hung_uninterruptible_tasks+0x11b/0x140
  [15956.421117]  [<ffffffff810d9910>] ? 
check_hung_uninterruptible_tasks+0x140/0x140
  [15956.431847]  [<ffffffff810d995f>] watchdog+0x4f/0x60
  [15956.437524]  [<ffffffff8108b99c>] kthread+0x8c/0xa0
  [15956.443145]  [<ffffffff8166f434>] kernel_thread_helper+0x4/0x10
  [15956.448830]  [<ffffffff8108b910>] ? flush_kthread_worker+0xa0/0xa0
  [15956.454700]  [<ffffffff8166f430>] ? gs_change+0x13/0x13
  """

  Analysis being made on the comments...

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534413/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED

Reply via email to