Public bug reported: Using Ubuntu Xenial user reports processes hang in D state waiting for disk io.
Ocassionally one of the applications gets into "D" state on NFS reads/sync and close system calls. based on the kernel backtraces seems to be stuck in kmalloc allocation during cleanup of dirty NFS pages. All the subsequent operations on the NFS mounts are stuck and reboot is required to rectify the situation. [Test scenario] 1) Applications running in Docker environment 2) Application have cgroup limits --cpu-shares --memory -shm-limit 3) python and C++ based applications (torch and caffe) 4) Applications read big lmdb files and write results to NFS shares 5) use NFS v3 , hard and fscache is enabled 6) now swap space is configured This prevents all other I/O activity on that mount to hang. we are running into this issue more frequently and identified few applications causing this problem. As updated in the description, the problem seems to be happening when exercising the stack try_to_free_mem_cgroup_pages+0xba/0x1a0 we see this with docker containers with cgroup option --memory <USER_SPECIFIED_MEM>. whenever there is a deadlock, we see that the process that is hung has reached the maximum cgroup limit, multiple times and typically cleans up dirty data and caches to bring the usage under the limit. This reclaim path happens many times and finally we hit probably a race get into deadlock ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Dragan S. (dragan-s) Status: Incomplete ** Changed in: linux (Ubuntu) Milestone: None => xenial-updates ** Changed in: linux (Ubuntu) Assignee: (unassigned) => Dragan S. (dragan-s) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1750038 Title: user space process hung in 'D' state waiting for disk io to complete Status in linux package in Ubuntu: Incomplete Bug description: Using Ubuntu Xenial user reports processes hang in D state waiting for disk io. Ocassionally one of the applications gets into "D" state on NFS reads/sync and close system calls. based on the kernel backtraces seems to be stuck in kmalloc allocation during cleanup of dirty NFS pages. All the subsequent operations on the NFS mounts are stuck and reboot is required to rectify the situation. [Test scenario] 1) Applications running in Docker environment 2) Application have cgroup limits --cpu-shares --memory -shm-limit 3) python and C++ based applications (torch and caffe) 4) Applications read big lmdb files and write results to NFS shares 5) use NFS v3 , hard and fscache is enabled 6) now swap space is configured This prevents all other I/O activity on that mount to hang. we are running into this issue more frequently and identified few applications causing this problem. As updated in the description, the problem seems to be happening when exercising the stack try_to_free_mem_cgroup_pages+0xba/0x1a0 we see this with docker containers with cgroup option --memory <USER_SPECIFIED_MEM>. whenever there is a deadlock, we see that the process that is hung has reached the maximum cgroup limit, multiple times and typically cleans up dirty data and caches to bring the usage under the limit. This reclaim path happens many times and finally we hit probably a race get into deadlock To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1750038/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp