Ok, thx for the info.
I've arranged to get in both and submitted the request to the kernel teams 
mailing list.
(Thekernel team will notice that one is now stable and get it via that way ...)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1975582

Title:
  [UBUNTU 20.04] rcu stalls with many storage key guests

Status in Ubuntu on IBM z Systems:
  In Progress
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  In Progress
Status in linux source package in Impish:
  In Progress
Status in linux source package in Jammy:
  In Progress

Bug description:
  SRU Justification:
  ==================

  [Impact]

   * Ubuntu on s390x KVM environments with lots of large guests with storage
     keys can be affected by rcu stalls.

   * These rcu stalls can cause the system to crash/dump.

  [Fix]

   * 3ae11dbcfac9 3ae11dbcfac906a8c3a480e98660a823130dc16a "s390/mm: use
  non-quiescing sske for KVM switch to keyed guest"

   * 6d5946274df1 6d5946274df1fff539a7eece458a43be733d1db8 "s390/gmap:
  voluntarily schedule during key setting"

  [Test Plan]

   * There is no trigger or direct test or re-creation of the 
     problem situation possible, but...

   * and IBM z13 or LinuxONE (or never) LPAR is needed that
     runs Ubuntu Server 20.04 LTS or 18.04 LTS with HWE kernel
     and acts as KVM host with again several large guests running
     on top with storage groups.

   * Let such a system running for days under significant load
     and watch the logs for rcu issues.

   * Prior to the submission of this SRU patched test kernels
     for focal 5.4 and bionic hwe-5.4 were created and tested.
     They ran for days at a staging environemnt at IBM
     without further issues.

   * The modifications are all limited to s390x.

   * A test kernel was build (see below) that ran in a test environment
     at IBM under appropriate load for several days.

  [Where problems could occur]

   * Due to the change for the KVM switch to keyed guest
     from classic sske to non-quiescing sske
     the KVM behaviour might have changed and the storage keys harmed.

   * The now more generous scheduling while setting keys
     has an impact on the guest memory management and mapping
     which will lead to a different performance.

   * This, with the introduction of __s390_enable_skey_pmd and
     cond_resched, might increase the overhead in certain situations,
     but eventually improves the responsiveness over time,
     hence avoid rcu stalls.

  [Other Info]
   
   * Since the patches are upstream in 5.19-rc1,
     they will be included in the kernel that is planned for kinetic (5.19).

   * Hence this is an SRU to jammy, impish and focal.

  __________

  ---Problem Description---
  There can be rcu stalls when running lots of large guests with storage keys:

  [1377614.579833] rcu: INFO: rcu_sched self-detected stall on CPU
  [1377614.579845] rcu:   18-....: (2099 ticks this GP) 
idle=54e/1/0x4000000000000002 softirq=35598716/35598716 fqs=998
  [1377614.579895]        (t=2100 jiffies g=155867385 q=20879)
  [1377614.579898] Task dump for CPU 18:
  [1377614.579899] CPU 1/KVM       R  running task        0 1030947 256019 
0x06000004
  [1377614.579902] Call Trace:
  [1377614.579912] ([<0000001f1f4b4f52>] show_stack+0x7a/0xc0)
  [1377614.579918]  [<0000001f1ec8e96c>] sched_show_task.part.0+0xdc/0x100
  [1377614.579919]  [<0000001f1f4b7248>] rcu_dump_cpu_stacks+0xc0/0x100
  [1377614.579924]  [<0000001f1ecdd10c>] rcu_sched_clock_irq+0x75c/0x980
  [1377614.579926]  [<0000001f1eceb26c>] update_process_times+0x3c/0x80
  [1377614.579931]  [<0000001f1ecfcfea>] tick_sched_handle.isra.0+0x4a/0x70
  [1377614.579932]  [<0000001f1ecfd28e>] tick_sched_timer+0x5e/0xc0
  [1377614.579933]  [<0000001f1ecec294>] __hrtimer_run_queues+0x114/0x2f0
  [1377614.579935]  [<0000001f1ececfdc>] hrtimer_interrupt+0x12c/0x2a0
  [1377614.579938]  [<0000001f1ebecb6a>] do_IRQ+0xaa/0xb0
  [1377614.579942]  [<0000001f1f4c6d08>] ext_int_handler+0x130/0x134
  [1377614.579945]  [<0000001f1ec0af10>] ptep_zap_key+0x40/0x60

  Contact Information = cborn...@de.ibm.com

  ---uname output---
       RELEASE: 5.4.0-90-generic
       VERSION: #101-Ubuntu SMP Fri Oct 15 19:59:45 UTC 2021

  == Comment: #1 - Christian Borntraeger <cborn...@de.ibm.com> - 2022-05-24 
03:59:37 ==
  This is a test patch that might address the rcu stalls.

  == Comment: #2 - Christian Borntraeger <cborn...@de.ibm.com> - 2022-05-24 
04:00:22 ==
  This is a 2nd patch that reduces the cost of key setting.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1975582/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to