As original poster, if I didn't continue to post oom dumps, perhaps
things started to peter out on 4.8.0-39 or later.

What was particular about the load that triggered this bug was heavy IO
putting cache pressure on ext4 on a system where there's zero locality
of reference in anything read from or written to disk (ssd backed
storage).

In any case, by May these data storage servers that had been triggering
this issue had been decommissioned and IO strategy had changed.  Now
writes are written to a raw block device before being flushed to
filesystem periodically using O_DSYNC, taking ext4 disk cache out of the
equation.

The HWE kernel is now 4.10, and judging by the edge packages soon to be
4.13, so maybe its been fixed in that time.  However I'm no longer able
to confirm or deny that, as there's no possible way for me to reproduce
it anyway.  As per Rasmus' comment, its something that only happened on
production workloads.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1666260

Title:
  "Out of memory" errors after upgrade to 4.4.0-59 + 4.8.0-34

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Same as #1655842 - Started seeing oom-killer on multiple servers
  upgraded to 4.4.0-59.

  Unlike #1655842, also seeing the same oom-killer on multiple servers
  updated to 4.8.0-34.

  First upgraded them all 4.8 servers 4.8.0-36, then downgraded a few to
  4.4.0-63.  I am seeing an even more pronounced change in the memory
  usage, so I can only assume that 4.4.0-63 is also bugged with the same
  problem as 4.4.0-59 and 4.8.0-34.  Either because #1655842 is not
  fixed, or it is only fixed for certain kinds of workloads.

  These are the changes I'm seeing in our memory graphs between 4.4.0-59
  and 4.4.0-63/4.8.0-34.

  The symptoms I'm seeing are:

  Upgrading 4.4.0-57 -> 4.4.0-59:
  - /proc/meminfo:Buffers: Up from 9GB to 15GB
  - /proc/meminfo:Cached: Up from 5GB to 10GB
  - /proc/meminfo:SReclaimable: Down from 15GB to 5GB
  - /proc/meminfo:SUnreclaim: Staying at 50MB

  Upgrading 4.4.0-57 -> 4.4.0-63:
  - /proc/meminfo:Buffers: Up from 9GB to 26GB
  - /proc/meminfo:Cached: Down from 5GB to 300MB
  - /proc/meminfo:SReclaimable: Down from 15GB to 2GB
  - /proc/meminfo:SUnreclaim: Down from 50MB to 30MB

  Upgrading 4.4.0-57 -> 4.8.0-34:
  - /proc/meminfo:Buffers: Up from 9GB to 14GB
  - /proc/meminfo:Cached: Down from 5GB to 2GB
  - /proc/meminfo:SReclaimable: Down from 15GB to 14GB
  - /proc/meminfo:SUnreclaim: Staying at 50MB

  Setting vm.vfs_cache_pressure = 300 seems to have a positive effect of
  not causing OOMs.

  Downgrading to 4.4.0-57 also works.

  Will also note that I haven't had a definitive OOM in 4.4.0-63.  But
  the shift in memory usage is far too much from what I expect to be
  normal on these particular servers where I'm experiencing crashes.

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: linux-image-4.4.0-63-generic 4.4.0-63.84
  ProcVersionSignature: Ubuntu 4.4.0-63.84-generic 4.4.44
  Uname: Linux 4.4.0-63-generic x86_64
  AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 
2: ls: cannot access '/dev/snd/': No such file or directory
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.1-0ubuntu2.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  Date: Mon Feb 20 16:15:56 2017
  InstallationDate: Installed on 2012-06-04 (1721 days ago)
  InstallationMedia:

  IwConfig:
   lo        no wireless extensions.

   eth0      no wireless extensions.
  Lsusb: Error: [Errno 2] No such file or directory: 'lsusb'
  MachineType: System manufacturer System Product Name
  PciMultimedia:

  ProcFB: 0 VESA VGA
  ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-63-generic 
root=UUID=b790930f-ad81-4b27-a353-a4b3d6a29007 ro nomodeset nomdmonddf 
nomdmonisw
  RelatedPackageVersions:
   linux-restricted-modules-4.4.0-63-generic N/A
   linux-backports-modules-4.4.0-63-generic  N/A
   linux-firmware                            1.157.8
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  SourcePackage: linux
  UpgradeStatus: Upgraded to xenial on 2017-02-16 (4 days ago)
  dmi.bios.date: 10/17/2011
  dmi.bios.vendor: American Megatrends Inc.
  dmi.bios.version: 1106
  dmi.board.asset.tag: To be filled by O.E.M.
  dmi.board.name: P8H67-M PRO
  dmi.board.vendor: ASUSTeK Computer INC.
  dmi.board.version: Rev 1.xx
  dmi.chassis.asset.tag: Asset-1234567890
  dmi.chassis.type: 3
  dmi.chassis.vendor: Chassis Manufacture
  dmi.chassis.version: Chassis Version
  dmi.modalias: 
dmi:bvnAmericanMegatrendsInc.:bvr1106:bd10/17/2011:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKComputerINC.:rnP8H67-MPRO:rvrRev1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
  dmi.product.name: System Product Name
  dmi.product.version: System Version
  dmi.sys.vendor: System manufacturer

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1666260/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to