Public bug reported: 1. Patch to kernel/irq/affinity.c: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux- next.git/commit/kernel/irq/affinity.c?h=next-20181108&id=b82592199032bf7c778f861b936287e37ebc9f62.
2. Patches to kernel/irq/matrix.c. There are three patches for this. The first two from Fujitsu, and then there is the patch from Long that actually makes the previous two work correctly. a. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/kernel/irq/matrix.c?h=next-20181108&id=8ffe4e61c06a48324cfd97f1199bb9838acce2f2 b. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/kernel/irq/matrix.c?h=next-20181108&id=76f99ae5b54d48430d1f0c5512a84da0ff9761e0 c. https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=irq/core&id=e8da8794a7fd9eef1ec9a07f0d4897c68581c72b We expect the tip patch to be applied to Linux-next soon. genirq/affinity: Spread IRQs to all available NUMA nodes If the number of NUMA nodes exceeds the number of MSI/MSI-X interrupts which are allocated for a device, the interrupt affinity spreading code fails to spread them across all nodes. The reason is, that the spreading code starts from node 0 and continues up to the number of interrupts requested for allocation. This leaves the nodes past the last interrupt unused. This results in interrupt concentration on the first nodes which violates the assumption of the block layer that all nodes are covered evenly. As a consequence the NUMA nodes above the number of interrupts are all assigned to hardware queue 0 and therefore NUMA node 0, which results in bad performance and has CPU hotplug implications, because queue 0 gets shut down when the last CPU of node 0 is offlined. Go over all NUMA nodes and assign them round-robin to all requested interrupts to solve this. irq/matrix: Split out the CPU selection code into a helper Linux finds the CPU which has the lowest vector allocation count to spread out the non managed interrupts across the possible target CPUs, but does not do so for managed interrupts. Split out the CPU selection code into a helper function for reuse. No functional change. irq/matrix: Spread managed interrupts on allocation Linux spreads out the non managed interrupt across the possible target CPUs to avoid vector space exhaustion. Managed interrupts are treated differently, as for them the vectors are reserved (with guarantee) when the interrupt descriptors are initialized. When the interrupt is requested a real vector is assigned. The assignment logic uses the first CPU in the affinity mask for assignment. If the interrupt has more than one CPU in the affinity mask, which happens when a multi queue device has less queues than CPUs, then doing the same search as for non managed interrupts makes sense as it puts the interrupt on the least interrupt plagued CPU. For single CPU affine vectors that's obviously a NOOP. Restructre the matrix allocation code so it does the 'best CPU' search, add the sanity check for an empty affinity mask and adapt the call site in the x86 vector management code. genirq/matrix: Improve target CPU selection for managed interrupts.irq/core On large systems with multiple devices of the same class (e.g. NVMe disks, using managed interrupts), the kernel can affinitize these interrupts to a small subset of CPUs instead of spreading them out evenly. irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask of possible target CPUs which has the lowest number of interrupt vectors allocated. This is done by searching the CPU with the highest number of available vectors. While this is correct for non-managed CPUs it can select the wrong CPU for managed interrupts. Under certain constellations this results in affinitizing the managed interrupts of several devices to a single CPU in a set. The book keeping of available vectors works the following way: 1) Non-managed interrupts: available is decremented when the interrupt is actually requested by the device driver and a vector is assigned. It's incremented when the interrupt and the vector are freed. 2) Managed interrupts: Managed interrupts guarantee vector reservation when the MSI/MSI-X functionality of a device is enabled, which is achieved by reserving vectors in the bitmaps of the possible target CPUs. This reservation decrements the available count on each possible target CPU. When the interrupt is requested by the device driver then a vector is allocated from the reserved region. The operation is reversed when the interrupt is freed by the device driver. Neither of these operations affect the available count. The reservation persist up to the point where the MSI/MSI-X functionality is disabled and only this operation increments the available count again. For non-managed interrupts the available count is the correct selection criterion because the guaranteed reservations need to be taken into account. Using the allocated counter could lead to a failing allocation in the following situation (total vector space of 10 assumed): CPU0 CPU1 available: 2 0 allocated: 5 3 <--- CPU1 is selected, but available space = 0 managed reserved: 3 7 while available yields the correct result. For managed interrupts the available count is not the appropriate selection criterion because as explained above the available count is not affected by the actual vector allocation. The following example illustrates that. Total vector space of 10 assumed. The starting point is: CPU0 CPU1 available: 5 4 allocated: 2 3 managed reserved: 3 3 Allocating vectors for three non-managed interrupts will result in affinitizing the first two to CPU0 and the third one to CPU1 because the available count is adjusted with each allocation: CPU0 CPU1 available: 5 4 <- Select CPU0 for 1st allocation --> allocated: 3 3 available: 4 4 <- Select CPU0 for 2nd allocation --> allocated: 4 3 available: 3 4 <- Select CPU1 for 3rd allocation --> allocated: 4 4 But the allocation of three managed interrupts starting from the same point will affinitize all of them to CPU0 because the available count is not affected by the allocation (see above). So the end result is: CPU0 CPU1 available: 5 4 allocated: 5 3 Introduce a "managed_allocated" field in struct cpumap to track the vector allocation for managed interrupts separately. Use this information to select the target CPU when a vector is allocated for a managed interrupt, which results in more evenly distributed vector assignments. The above example results in the following allocations: CPU0 CPU1 managed_allocated: 0 0 <- Select CPU0 for 1st allocation --> allocated: 3 3 managed_allocated: 1 0 <- Select CPU1 for 2nd allocation --> allocated: 3 4 managed_allocated: 1 1 <- Select CPU0 for 3rd allocation --> allocated: 4 4 The allocation of non-managed interrupts is not affected by this change and is still evaluating the available count. The overall distribution of interrupt vectors for both types of interrupts might still not be perfectly even depending on the number of non-managed and managed interrupts in a system, but due to the reservation guarantee for managed interrupts this cannot be avoided. Expose the new field in debugfs as well. ** Affects: linux-azure (Ubuntu) Importance: Undecided Status: Confirmed ** Changed in: linux-azure (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1802358 Title: [Hyper-V] Fix IRQ spreading on NVMe devices with lower numbers of channels To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802358/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs