Eric Dumazet <eric.duma...@gmail.com> writes:
> CAUTION: This email originated from outside of the organization. Do not click > links or open attachments unless you can confirm the sender and know the > content is safe. > > > > On 7/12/20 3:36 PM, akiy...@amazon.com wrote: >> From: Arthur Kiyanovski <akiy...@amazon.com> >> >> In napi busy-poll mode, the kernel invokes the napi handler of the >> device repeatedly to poll the NIC's receive queues. This process >> repeats until a timeout, specific for each connection, is up. >> By polling packets in busy-poll mode the user may gain lower latency >> and higher throughput (since the kernel no longer waits for interrupts >> to poll the queues) in expense of CPU usage. >> >> Upon completing a napi routine, the driver checks whether >> the routine was called by an interrupt handler. If so, the driver >> re-enables interrupts for the device. This is needed since an >> interrupt routine invocation disables future invocations until >> explicitly re-enabled. >> >> The driver avoids re-enabling the interrupts if they were not disabled >> in the first place (e.g. if driver in busy mode). >> Originally, the driver checked whether interrupt re-enabling is needed >> by reading the 'ena_napi->unmask_interrupt' variable. This atomic >> variable was set upon interrupt and cleared after re-enabling it. >> >> In the 4.10 Linux version, the 'napi_complete_done' call was changed >> so that it returns 'false' when device should not re-enable >> interrupts, and 'true' otherwise. The change includes reading the >> "NAPIF_STATE_IN_BUSY_POLL" flag to check if the napi call is in >> busy-poll mode, and if so, return 'false'. >> The driver was changed to re-enable interrupts according to this >> routine's return value. >> The Linux community rejected the use of the >> 'ena_napi->unmaunmask_interrupt' variable to determine whether >> unmasking is needed, and urged to use napi_napi_complete_done() >> return value solely. >> See https://lore.kernel.org/patchwork/patch/741149/ for more details > > Yeah, and I see you did not bother to CC me on this new submission. > I apologize for this. We should have been more careful. We'll take better notice next time >> >> As explained, a busy-poll session exists for a specified timeout >> value, after which it exits the busy-poll mode and re-enters it later. >> This leads to many invocations of the napi handler where >> napi_complete_done() false indicates that interrupts should be >> re-enabled. >> This creates a bug in which the interrupts are re-enabled >> unnecessarily. >> To reproduce this bug: >> 1) echo 50 | sudo tee /proc/sys/net/core/busy_poll >> 2) echo 50 | sudo tee /proc/sys/net/core/busy_read >> 3) Add counters that check whether >> 'ena_unmask_interrupt(tx_ring, rx_ring);' >> is called without disabling the interrupts in the first >> place (i.e. with calling the interrupt routine >> ena_intr_msix_io()) >> >> Steps 1+2 enable busy-poll as the default mode for new connections. >> >> The busy poll routine rearms the interrupts after every session by >> design, and so we need to add an extra check that the interrupts were >> masked in the first place. >> >> Signed-off-by: Shay Agroskin <shay...@amazon.com> >> Signed-off-by: Arthur Kiyanovski <akiy...@amazon.com> >> --- >> drivers/net/ethernet/amazon/ena/ena_netdev.c | 7 ++++++- >> drivers/net/ethernet/amazon/ena/ena_netdev.h | 1 + >> 2 files changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c >> b/drivers/net/ethernet/amazon/ena/ena_netdev.c >> index 91be3ffa1c5c..90c0fe15cd23 100644 >> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c >> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c >> @@ -1913,7 +1913,9 @@ static int ena_io_poll(struct napi_struct *napi, int >> budget) >> /* Update numa and unmask the interrupt only when schedule >> * from the interrupt context (vs from sk_busy_loop) >> */ >> - if (napi_complete_done(napi, rx_work_done)) { >> + if (napi_complete_done(napi, rx_work_done) && >> + READ_ONCE(ena_napi->interrupts_masked)) { >> + WRITE_ONCE(ena_napi->interrupts_masked, false); >> /* We apply adaptive moderation on Rx path only. >> * Tx uses static interrupt moderation. >> */ >> @@ -1961,6 +1963,9 @@ static irqreturn_t ena_intr_msix_io(int irq, void >> *data) >> >> ena_napi->first_interrupt = true; >> >> + WRITE_ONCE(ena_napi->interrupts_masked, true); >> + smp_wmb(); /* write interrupts_masked before calling napi */ > > It is not clear where is the paired smp_wmb() > Can you please explain what you mean ? The idea of adding the store barrier here is to ensure that the WRITE_ONCE(…) invocation is executed before invoking the napi soft irq. From what I gathered using this command would result in compiler barrier (which would prevent it from executing the bool store after napi scheduling) on x86 and a memory barrier on ARM64 machines which have a weaker consistency model. >> + >> napi_schedule_irqoff(&ena_napi->napi); >> >> return IRQ_HANDLED; >> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h >> b/drivers/net/ethernet/amazon/ena/ena_netdev.h >> index ba030d260940..89304b403995 100644 >> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.h >> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h >> @@ -167,6 +167,7 @@ struct ena_napi { >> struct ena_ring *rx_ring; >> struct ena_ring *xdp_ring; >> bool first_interrupt; >> + bool interrupts_masked; >> u32 qid; >> struct dim dim; >> }; >> > > Note that writing/reading over bool fields from hard irq context without > proper sync is not generally allowed. > > Compiler could perform RMW over plain 32bit words. Doesn't the READ/WRITE_ONCE macros prevent the compiler from reordering these instructions ? Also from what I researched (please correct me if I'm wrong here) both x86 and ARM don't allow reordering LOAD and STORE when they access the same variable (register or memory address). > > Sometimes we accept the races, but in this context this might be bad. As long a the writing of the flag is atomic in the sense that the value in memory isn't some hybrid value of two parallel stores, the existing race cannot result in a dead lock or leaving the interrupt vector masked. Am I missing something ? Thank you for taking the time to look at this patch. Shay