In mana driver, the number of IRQs allocated are capped by the
min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
than the vcpu count, we want to utilize all the vcpus, irrespective of
their NUMA/core bindings.

This is important, especially in the envs where number of vcpus are so
few that the softIRQ handling overhead on two IRQs on the same vcpu is
much more than their overheads if they were spread across sibling vcpus

This behaviour is more evident with dynamic IRQ allocation. Since MANA
IRQs are assigned at a later stage compared to static allocation, other
device IRQs may already be affinitized to the vCPUs. As a result, IRQ
weights become imbalanced, causing multiple MANA IRQs to land on the
same vCPU.

In such cases when many parallel TCP connections are tested, the
throughput drops significantly

Test envs:
=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

        TYPE            effective vCPU aff
=======================================================
IRQ0:   HWC             0
IRQ1:   mana_q1         0
IRQ2:   mana_q2         2
IRQ3:   mana_q3         0
IRQ4:   mana_q4         3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU            0       1       2       3
=======================================================
pass 1:         38.85   0.03    24.89   24.65
pass 2:         39.15   0.03    24.57   25.28
pass 3:         40.36   0.03    23.20   23.17

=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

        TYPE            effective vCPU aff
=======================================================
IRQ0:   HWC             0
IRQ1:   mana_q1         0
IRQ2:   mana_q2         1
IRQ3:   mana_q3         2
IRQ4:   mana_q4         3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU            0       1       2       3
=======================================================
pass 1:         15.42   15.85   14.99   14.51
pass 2:         15.53   15.94   15.81   15.93
pass 3:         16.41   16.35   16.40   16.36

=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn        with patch      w/o patch
20480           15.65           7.73
10240           15.63           8.93
8192            15.64           9.69
6144            15.64           13.16
4096            15.69           15.75
2048            15.69           15.83
1024            15.71           15.28

Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: [email protected]
Signed-off-by: Shradha Gupta <[email protected]>
Signed-off-by: Erni Sri Satya Vennela <[email protected]>
Reviewed-by: Haiyang Zhang <[email protected]>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c 
b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..433c044d53c6 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int 
len, int node,
        return 0;
 }
 
+static int irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+       int cpu;
+
+       rcu_read_lock();
+       for_each_online_cpu(cpu) {
+               if (len <= 0)
+                       break;
+
+               irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+               len--;
+       }
+       rcu_read_unlock();
+
+       return 0;
+}
+
 static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 {
        struct gdma_context *gc = pci_get_drvdata(pdev);
@@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, 
int nvec)
         * first CPU sibling group since they are already affinitized to HWC IRQ
         */
        cpus_read_lock();
-       if (gc->num_msix_usable <= num_online_cpus())
+       if (gc->num_msix_usable <= num_online_cpus()) {
                skip_first_cpu = true;
+               err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
+       } else {
+               /*
+                * In case our IRQs are more than num_online_cpus, we try to
+                * make sure we are using all vcpus. In such a case NUMA or
+                * CPU core affinity does not matter.
+                * Note that in this case the total mana IRQ should always be
+                * num_online_cpu + 1. The first HWC IRQ is already handled
+                * in HWC setup calls
+                * So, the nvec value in this path should always be equal to
+                * num_online_cpu
+                */
+               WARN_ON(nvec > num_online_cpus());
+               err = irq_setup_linear(irqs, nvec);
+       }
 
-       err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
        if (err) {
                cpus_read_unlock();
                goto free_irq;

base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
-- 
2.34.1


Reply via email to