On Thu, 21 May 2026 23:59:41 -0700
Wei Hu <[email protected]> wrote:

> From: Wei Hu <[email protected]>
> 
> Add support for handling hardware service reset events in the
> MANA driver. When the MANA kernel driver receives a hardware
> service event, it initiates a device reset and notifies userspace
> via IBV_EVENT_DEVICE_FATAL. The MANA PMD handles this by
> performing an automatic teardown and recovery sequence.
> 
> The driver uses ethdev recovery events (ERR_RECOVERING,
> RECOVERY_SUCCESS, RECOVERY_FAILED) to notify upper layers of
> the reset lifecycle, and a PCI device removal event callback
> to distinguish hot-remove from service reset.
> 
> Changes since v2:
> - Fixed dev_state_qsv memory leak on device removal
> - Fixed reset thread TCB/stack leak: reset_thread_active is now
>   only cleared by the joiner, not the thread itself
> - Fixed second reset crash: removed reset thread join logic from
>   mana_dev_close (inner function) to avoid corrupting dev_state
>   when called from mana_reset_enter
> - Made reset_thread_active RTE_ATOMIC(bool) with explicit ordering
> - Added retry loop for rte_dev_event_callback_unregister on -EAGAIN
> - Initialized condvar/mutex with PTHREAD_PROCESS_SHARED since priv
>   is in hugepage shared memory
> - Added re-check of dev_state after lock acquisition in
>   mana_intr_handler to prevent racing with pci_remove_event_cb
> - Replaced (void *)0 with NULL in mp.c
> - Added lock ownership comment block at mana_reset_enter
> - Documented rte_dev_event_monitor_start() requirement
> - Added mana.rst documentation and release note (patch 2/2)
> 
> Changes since v1:
> - Removed net/netvsc patch from this series
> - Simplified reset exit: mana_reset_exit calls
>   mana_reset_exit_delay directly instead of spawning a thread
> - Added __rte_no_thread_safety_analysis annotations for clang
> - Switched to rte_thread_create_internal_control
> - Fixed declaration-after-statement style issues
> - Removed unnecessary blank lines and stale comments
> 
> Wei Hu (2):
>   net/mana: add device reset support
>   net/mana: add documentation for device reset support
> 
>  doc/guides/nics/mana.rst               |  33 +
>  doc/guides/rel_notes/release_26_07.rst |   8 +
>  drivers/net/mana/mana.c                | 995 ++++++++++++++++++++++---
>  drivers/net/mana/mana.h                |  33 +-
>  drivers/net/mana/meson.build           |   2 +-
>  drivers/net/mana/mp.c                  |  89 ++-
>  drivers/net/mana/mr.c                  |   6 +-
>  drivers/net/mana/rx.c                  |  24 +-
>  drivers/net/mana/tx.c                  |  40 +-
>  9 files changed, 1123 insertions(+), 107 deletions(-)
> 

AI review found a few things...


Thanks for the v3.  Most of the v2 items are addressed: dev_state_qsv
is freed via mana_dev_free_resources, reset_thread_active is now
RTE_ATOMIC(bool), the joiner owns the flag, pthread mutex/cond use
PROCESS_SHARED attrs, intr_handler re-checks state under the lock,
the lock hand-off has a clear comment, and a doc + release note
patch is included.

A few items remain or were introduced by the v2->v3 changes.

Errors

1. Deadlock in dev_close_lock + EAGAIN loop on PCI-remove callback.

   The v3 fix for the discarded unregister return uses a busy-wait:

       if (dev->device) {
               do {
                       ret = rte_dev_event_callback_unregister(...);
               } while (ret == -EAGAIN);
       }

   mana_intr_uninstall runs from mana_dev_close, which runs from
   mana_dev_close_lock with reset_ops_lock held.  EAL returns
   -EAGAIN while cb_lst->active is 1 (callback dispatching).  The
   callback in question is mana_pci_remove_event_cb, which itself
   blocks on rte_spinlock_lock(&priv->reset_ops_lock).  Result:
   dev_close holds the spinlock, the callback waits for the
   spinlock, the loop waits for the callback to finish -- three-way
   deadlock if a hot-remove event arrives during close.

   Either drop the lock before unregistering and re-take it after,
   or (better) use a sleeping mutex for reset_ops_lock so the
   callback can complete while close is suspended.

2. reset_ops_lock is still rte_spinlock_t held across blocking work.

   Unchanged from v2: the spinlock is held through

     - mana_reset_enter:      rte_rcu_qsbr_check spin,
                              mana_dev_stop,
                              mana_mp_req_on_rxtx (5s IPC timeout),
                              mana_dev_close
     - mana_reset_exit_delay: ibv_close_device, mana_pci_probe,
                              mana_mp_req_on_rxtx, mana_dev_start

   Any other ethdev op that hits dev_ops while reset is in flight
   spins for tens of seconds.  Blocking calls under a spinlock are
   an Error; use a properly-initialised pthread_mutex_t (you
   already have PROCESS_SHARED attrs handy).

Warnings

3. Secondary process: qsbr does not actually quiesce secondary lcores.

   rte_rcu_qsbr_thread_register is only called from
   mana_dev_configure, which the secondary never runs.  The
   secondary's rx_burst/tx_burst still call thread_online/offline
   against the shared qsv, but the secondary tids are unregistered,
   so they are invisible to rte_rcu_qsbr_check in the primary, and
   the secondary MP handler (mana_mp_reset_enter) does not call
   qsbr_check at all -- it just sets db_page = NULL and munmaps.

   The dev_state check at the top of secondary tx_burst is racy:
   the page can be munmapped after the in-loop read of db_page but
   before the doorbell write at the bottom.  The "All secondary
   threads are quiescent" log line in mana_mp_reset_enter is not
   true.

   The secondary needs a real reader-side primitive -- its own
   qsbr with secondary lcore registration, or an rwlock the MP
   handler takes before munmap.

4. Double-join race between dev_close_lock and mana_reset_enter.

   dev_close_lock signals + joins the reset thread *before* taking
   reset_ops_lock.  intr_handler may fire in that window:

     1. dev_close_lock: pthread_cond_signal, rte_thread_join
     2. intr_handler:   take lock, see ACTIVE, call reset_enter
     3. reset_enter:    reset_thread_active still true (step 1
                        hasn't cleared it yet) -> rte_thread_join
                        on a thread that step 1 just joined -> UB

   The pre-lock signal/join was added to avoid deadlock with the
   reset thread holding the lock, which is fine in itself, but the
   reset_thread_active update needs to be inside that critical
   section, or reset_enter has to recheck under the lock.

5. mana_dev_close (non-_lock) does not join the reset thread.

   mana_dev_uninit -> mana_dev_close path doesn't go through
   dev_close_lock, so if pci_remove arrives while the reset thread
   is alive, mana_dev_free_resources can destroy reset_cond_mutex /
   reset_cond before the thread has exited.  In practice the
   pci_remove callback signals RESET_FAILED and the reset thread
   should exit quickly, but the ordering isn't enforced.  Either
   join in mana_dev_uninit, or document the assumption.

Info

6. mana_reset_exit_delay sets priv->ib_ctx = NULL even when
   ibv_close_device fails:

       ret = ibv_close_device(priv->ib_ctx);
       priv->ib_ctx = NULL;
       if (ret) { ... goto out; }

   The handle is leaked if close fails.  Either keep the old
   pointer on failure or accept that close-failure is unrecoverable
   and free it anyway with a comment.

Patch 2/2

7. The events list is a term/description pattern and should be a
   definition list per DPDK RST style:

       ``RTE_ETH_EVENT_ERR_RECOVERING``
          Reset has started.

       ``RTE_ETH_EVENT_RECOVERY_SUCCESS``
          Device has recovered successfully.

       ``RTE_ETH_EVENT_RECOVERY_FAILED``
          Recovery failed.

8. Convention is to fold doc + release note updates into the same
   commit as the feature.  Patches 1/2 and 2/2 can be squashed.

Reply via email to