As original poster, if I didn't continue to post oom dumps, perhaps things started to peter out on 4.8.0-39 or later.
What was particular about the load that triggered this bug was heavy IO putting cache pressure on ext4 on a system where there's zero locality of reference in anything read from or written to disk (ssd backed storage). In any case, by May these data storage servers that had been triggering this issue had been decommissioned and IO strategy had changed. Now writes are written to a raw block device before being flushed to filesystem periodically using O_DSYNC, taking ext4 disk cache out of the equation. The HWE kernel is now 4.10, and judging by the edge packages soon to be 4.13, so maybe its been fixed in that time. However I'm no longer able to confirm or deny that, as there's no possible way for me to reproduce it anyway. As per Rasmus' comment, its something that only happened on production workloads. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1666260 Title: "Out of memory" errors after upgrade to 4.4.0-59 + 4.8.0-34 Status in linux package in Ubuntu: Confirmed Bug description: Same as #1655842 - Started seeing oom-killer on multiple servers upgraded to 4.4.0-59. Unlike #1655842, also seeing the same oom-killer on multiple servers updated to 4.8.0-34. First upgraded them all 4.8 servers 4.8.0-36, then downgraded a few to 4.4.0-63. I am seeing an even more pronounced change in the memory usage, so I can only assume that 4.4.0-63 is also bugged with the same problem as 4.4.0-59 and 4.8.0-34. Either because #1655842 is not fixed, or it is only fixed for certain kinds of workloads. These are the changes I'm seeing in our memory graphs between 4.4.0-59 and 4.4.0-63/4.8.0-34. The symptoms I'm seeing are: Upgrading 4.4.0-57 -> 4.4.0-59: - /proc/meminfo:Buffers: Up from 9GB to 15GB - /proc/meminfo:Cached: Up from 5GB to 10GB - /proc/meminfo:SReclaimable: Down from 15GB to 5GB - /proc/meminfo:SUnreclaim: Staying at 50MB Upgrading 4.4.0-57 -> 4.4.0-63: - /proc/meminfo:Buffers: Up from 9GB to 26GB - /proc/meminfo:Cached: Down from 5GB to 300MB - /proc/meminfo:SReclaimable: Down from 15GB to 2GB - /proc/meminfo:SUnreclaim: Down from 50MB to 30MB Upgrading 4.4.0-57 -> 4.8.0-34: - /proc/meminfo:Buffers: Up from 9GB to 14GB - /proc/meminfo:Cached: Down from 5GB to 2GB - /proc/meminfo:SReclaimable: Down from 15GB to 14GB - /proc/meminfo:SUnreclaim: Staying at 50MB Setting vm.vfs_cache_pressure = 300 seems to have a positive effect of not causing OOMs. Downgrading to 4.4.0-57 also works. Will also note that I haven't had a definitive OOM in 4.4.0-63. But the shift in memory usage is far too much from what I expect to be normal on these particular servers where I'm experiencing crashes. ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-image-4.4.0-63-generic 4.4.0-63.84 ProcVersionSignature: Ubuntu 4.4.0-63.84-generic 4.4.44 Uname: Linux 4.4.0-63-generic x86_64 AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.1-0ubuntu2.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' Date: Mon Feb 20 16:15:56 2017 InstallationDate: Installed on 2012-06-04 (1721 days ago) InstallationMedia: IwConfig: lo no wireless extensions. eth0 no wireless extensions. Lsusb: Error: [Errno 2] No such file or directory: 'lsusb' MachineType: System manufacturer System Product Name PciMultimedia: ProcFB: 0 VESA VGA ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-63-generic root=UUID=b790930f-ad81-4b27-a353-a4b3d6a29007 ro nomodeset nomdmonddf nomdmonisw RelatedPackageVersions: linux-restricted-modules-4.4.0-63-generic N/A linux-backports-modules-4.4.0-63-generic N/A linux-firmware 1.157.8 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' SourcePackage: linux UpgradeStatus: Upgraded to xenial on 2017-02-16 (4 days ago) dmi.bios.date: 10/17/2011 dmi.bios.vendor: American Megatrends Inc. dmi.bios.version: 1106 dmi.board.asset.tag: To be filled by O.E.M. dmi.board.name: P8H67-M PRO dmi.board.vendor: ASUSTeK Computer INC. dmi.board.version: Rev 1.xx dmi.chassis.asset.tag: Asset-1234567890 dmi.chassis.type: 3 dmi.chassis.vendor: Chassis Manufacture dmi.chassis.version: Chassis Version dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1106:bd10/17/2011:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKComputerINC.:rnP8H67-MPRO:rvrRev1.xx:cvnChassisManufacture:ct3:cvrChassisVersion: dmi.product.name: System Product Name dmi.product.version: System Version dmi.sys.vendor: System manufacturer To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1666260/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp