Public bug reported: Background:
When handling multifunction devices in zPCI we take the UID of the PCI function with function number 0 (that always exists according to the PCI spec) as domain number. Therefore when hot plugging functions with function number larger than 0 before function 0, we need to hold these in standby before creating the domain and bus. This has been tested during feature development using a patched QEMU and with DPM but never in Classic Mode. Reproduction: This issue was introduced with the Topology aware PCI Enumeration code so test with a Linux supporting that feature. E.g. Upstream, Devel Driver etc. On a Classic Mode machine with a multi-function device, hot plug ("Reassign I/O Path") only the FID of the second port to the LPAR. Symptom: After this any additional hotplug and even just deconfiguring a PCI device will hang. A hotplug makes the entire Linux instance unresponsive. Analysis: The problem occurs in Classic Mode but not with previous testing as the LPAR hypervisor does hot plug/Reassign I/O Path as a two step process: 1. zPCI event with PEC 0x0302 to plug the zPCI function in Standby 2. zPCI event with PEC 0x0301 to configure the zPCI function For the first event we create the zdev in clp_add_pci_device() in Standby which is all fine so far. The problem then occurs in step 2 as we then find the existing zdev and try to configure it. This however does not work as the PCI bus is not yet created (as we still don't know the UID of function 0 that will become its domain). The bus pointer zdev->zbus->bus pointer is thus still NULL but will be accessed by common code which inevitably results in disaster including the above mentioned hang and (possibly) the below RCU stall: [ 689.724703] rcu: INFO: rcu_sched self-detected stall on CPU [ 689.724712] rcu: 16-....: (42004 ticks this GP) idle=6ee/1/0x4000000000000002 softirq=1234/1234 fqs=14001 [ 689.724742] (t=42006 jiffies g=89 q=3770) [ 689.724743] Task dump for CPU 16: [ 689.724745] task:kmcheck state:R running task stack: 0 pid: 205 ppid: 2 flags:0x00000004 [ 689.724747] Call Trace: [ 689.724757] [<0000000ccde0b5c4>] show_stack+0x8c/0xd8 [ 689.724762] [<0000000ccd0dabc4>] sched_show_task.part.0+0xe4/0x110 [ 689.724764] [<0000000ccde0ea5e>] rcu_dump_cpu_stacks+0xde/0x120 [ 689.724767] [<0000000ccd1465c6>] print_cpu_stall+0x266/0x330 [ 689.724768] [<0000000ccd14a428>] rcu_sched_clock_irq+0x618/0x670 [ 689.724771] [<0000000ccd15cd7a>] update_process_times+0xba/0xf0 [ 689.724775] [<0000000ccd1766fa>] tick_sched_timer+0x9a/0x220 [ 689.724777] [<0000000ccd15d962>] __hrtimer_run_queues+0x182/0x3a0 [ 689.724779] [<0000000ccd1602f8>] hrtimer_interrupt+0x138/0x450 [ 689.724782] [<0000000ccd0451c0>] do_IRQ+0x90/0xa0 [ 689.724784] [<0000000ccde2be96>] ext_int_handler+0x17e/0x184 [ 689.724790] [<0000000ccd9f373e>] pci_get_slot+0x5e/0xa0 [ 689.724794] [<0000000ccd9dc182>] pci_scan_single_device+0x32/0x2a0 [ 689.724797] [<0000000ccd0868f2>] __zpci_event_availability+0x192/0x360 [ 689.724800] [<0000000ccdd40c16>] chsc_process_crw+0x2e6/0x300 [ 689.724802] [<0000000ccdd4b088>] crw_collect_info+0x2b8/0x320 [ 689.724804] [<0000000ccd0caf3a>] kthread+0x14a/0x170 [ 689.724805] [<0000000ccde2b814>] ret_from_fork+0x24/0x2c The fix is very simple, we check zdev->zbus->bus for being NULL and in that case bail from the case 0x0301 before calling the PCI common code pci_scan_single_device() with the NULL pointer. The only subtlety is that we still need to do the zpci_enable_device() because the code in arch/s390/pci/pci_bus.c assumes that it can immediately do a scan of all devfn != 0 PCI functions once PCI function 0 is found. It thereby mimics what happens when we only find the FID for a function with devfn != 0 in the CLP List PCI Functions. This is implemented in the following upstream commit: 0b2ca2c7d0c9e2731d01b6c862375d44a7e13923 s390/pci: fix hot-plug of PCI function missing bus It is included in v5.10-rc3 and has been tagged for stable > v5.8 i.e. all upstream versions with the PCI enumeration changes. Also it carries the appropriate Fixes tag. I have verified that it cherry-picks cleanly on current focal master-next and expect it to cleanly cherry-pick on newer Ubuntu Kernels too. ** Affects: ubuntu-z-systems Importance: Medium Status: New ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Skipper Bug Screeners (skipper-screen-team) Status: New ** Tags: architecture-s39064 bugnameltc-189163 severity-medium targetmilestone-inin2010 ** Tags added: architecture-s39064 bugnameltc-189163 severity-medium targetmilestone-inin2010 ** Changed in: ubuntu Assignee: (unassigned) => Skipper Bug Screeners (skipper-screen-team) ** Package changed: ubuntu => linux (Ubuntu) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1903682 Title: [UBUNTU 20.10] NULL pointer dereference when configuring multi- function with devfn != 0 before devfn == 0 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-z-systems/+bug/1903682/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs