On Fri, 29 May 2026 07:26:48 -0700
Wei Hu <[email protected]> wrote:

> From: Wei Hu <[email protected]>
> 
> Add support for handling hardware reset events in the MANA driver.
> When the MANA kernel driver receives a hardware service event, it
> initiates a device reset and notifies userspace via
> IBV_EVENT_DEVICE_FATAL. The DPDK driver handles this by performing
> an automatic teardown and recovery sequence.
> 
> The reset flow has two phases. In the enter phase, running on the
> EAL interrupt thread, the driver transitions the device state,
> waits for data path threads to reach a quiescent state using RCU,
> stops queues, tears down IB resources, and frees per-queue MR
> caches. A control thread is then spawned to handle the exit phase:
> it waits for the hardware to recover, unregisters the interrupt
> handler, re-probes the PCI device, reinitializes MR caches, and
> restarts queues.
> 
> A per-device mutex serializes the reset path with ethdev
> operations. The mutex uses PTHREAD_PROCESS_SHARED for multi-process
> support and is held across blocking IB verbs calls. Operations that
> cannot wait (configure, queue setup) return -EBUSY during reset,
> while dev_stop and dev_close join the reset thread before acquiring
> the lock to ensure proper sequencing. A CAS-based helper prevents
> double-join of the reset thread.
> 
> Multi-process support is included: secondary processes unmap and
> remap doorbell pages via IPC during the reset enter and exit
> phases. Data path functions in both primary and secondary
> processes check the device state atomically and return early when
> the device is not active. RCU quiescent state tracking uses
> per-queue thread IDs in shared hugepage memory, covering both
> primary and secondary process data path threads.
> 
> The driver uses ethdev recovery events to notify upper layers
> (e.g. netvsc) of the reset lifecycle: RTE_ETH_EVENT_ERR_RECOVERING
> on entry, RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED on completion. A PCI device removal
> event callback distinguishes hot-remove from service reset.
> 
> Documentation for the device reset feature is added in the MANA
> NIC guide and the 26.07 release notes.
> 
> Signed-off-by: Wei Hu <[email protected]>
> ---

I went a deeper with AI and tried to figure out a good way
to do what the driver is trying to do without reinventing so much.



This reset logic is considerably more complex than other DPDK drivers,
and most of the complexity looks self-inflicted. A few specific things.

The RCU use is not really RCU. thread_online/offline are called on every
rx/tx burst, and the "thread" token is the queue index, not a thread. So
it is a per-queue in-use flag paid for on the hottest path, plus a new
library dependency, to express "wait until no queue is mid-burst" -- which
the driver already half does by swapping the burst function to
mana_*_burst_removed and checking dev_state. Please drop the rcu use. If
you must drain readers, a per-queue atomic flag is lighter and local; the
fast path already has the dev_state acquire-load it needs.

The data path only needs the atomic, and that part is fine. The lock is
legitimate for serializing the teardown/rebuild against control ops, but
the way it is used is the problem: it is acquired in mana_intr_handler,
released in mana_reset_enter, re-acquired in mana_reset_thread, and
released in mana_reset_exit_delay. That cross-function, cross-thread
handoff is exactly why every function needs __rte_no_thread_safety_analysis.
Acquire and release the lock in the same function and the annotations all
go away. Turning off thread-safety analysis needs strong justification and
this does not have it.

Wrapping the ops in MANA_OPS_*_LOCK macros hides the lock/state protocol.
A single explicit helper at the top of each op is just as terse and stays
analyzable.

Two real bugs:

- Recovery and INTR_RMV events are sent via rte_eth_dev_callback_process
  while reset_ops_lock is held. An app handling INTR_RMV or RECOVERY_FAILED
  by calling dev_stop/dev_close will re-enter the lock (non-recursive) and
  deadlock; on the recovery path the callback runs on the reset thread so
  it also tries to join itself. Emit these events after dropping the lock,
  as ERR_RECOVERING already does.

- thread_online is taken at the top of the burst functions but
  thread_offline is only on some return paths. Any early return that
  misses it leaves a token non-quiescent and rte_rcu_qsbr_check() in
  mana_reset_enter spins forever. This is the kind of breakage the per-burst
  bracketing invites -- another reason to drop it.

Also: reset_ops_lock is held across ibv_close_device and the PCI re-probe
(blocking under a sleeping mutex), and rte_alarm.h is included but no
longer used.

Please look at how hns3 or mlx5 structure reset/recovery. Matching the
common pattern means fixes can be made across drivers at once.

Reply via email to