On Thu, May 21, 2026 at 11:04:12AM -0700, Stephen Hemminger wrote:
> The goal is to land every deprecation currently listed in the release
> notes by the 26.11 ABI bump. Functions to be removed in 26.11 need to be
> marked __rte_deprecated by 26.07, with all in-tree users converted off
> them first so CI stays clean.
> 
> This is the first step. After this series there are no remaining in-tree
> users of rte_atomic64. Expected follow-ups:
> 
>   - convert remaining rte_atomic32 users (dpaa/fslmc, netvsc, vmbus,
>   sw_evdev, txgbe, ifc, hinic, bnx2x, vhost) - convert remaining
>   rte_atomic16 users (dpaa/fslmc, qman) - mark the rte_atomicNN_*()
>   family __rte_deprecated - remove the legacy test_atomic.c - remove the
>   API itself at 26.11
> 
> Patch 1 deletes the inline-asm atomic fallbacks across arm, ppc,
> loongarch, riscv, and x86 now that RTE_FORCE_INTRINSICS has been the
> default everywhere for years. Largest patch by line count and the one
> most worth review attention.
> 
> Patch 2 retires the rte_smp_*mb deprecation notice (open since 2021) by
> reimplementing those APIs as inline wrappers over
> rte_atomic_thread_fence; the API is preserved for readability.
> 
> Patch 3 is the load-bearing change for lib/: the last caller of
> rte_atomic32_cmpset() is converted, with explicit acquire/release
> orderings matching the existing HTS/RTS ring pattern.
> 
> Driver conversions (patches 4-11) match each rte_atomic64 use to its best
> fit rather than blanket seq_cst: software stats become plain assignment
> (DPDK convention, torn reads accepted); CAS loops setting a flag collapse
> to fetch_or or exchange; open-coded link-status CAS in net/pfe and
> net/sfc moves to the existing rte_eth_linkstatus helpers; genuine
> synchronization stays atomic with explicit ordering.
> 
> v2 - fix clang build - replace rte_atomic64 in more drivers - incorporate
> feedback on rte_smp and ring - drop zxdh change (only caused by
> intrinsics in spinlock)
> 
> Stephen Hemminger (11): eal: use intrinsics for rte_atomic on all
> platforms eal: reimplement rte_smp_*mb with rte_atomic_thread_fence ring:
> use C11 atomic operations for MP/SP head/tail net/bonding: use stdatomic
> net/nbl: remove unused rte_atomic16 field net/ena: replace use of
> rte_atomicNN net/failsafe: convert to stdatomic net/enic: do not use
> deprecated rte_atomic64 net/pfe: use ethdev linkstatus helpers net/sfc:
> replace rte_atomic with stdatomic crypto/ccp: replace use of rte_atomic64
> with stdatomic
> 
I decided to test this patchset with the ring_perf_autotest (using only two
cores on same socket) to see how performance may be affected on x86 with
this change. On an initial once-off test to compare performance
with/without this patchset for MP/MC cases, it looks like smaller enq/deq
burst e.g. 8/32 are slower after this set, while larger bursts e.g. 128/256
are slightly faster.

I then ran two more tests with the patches applied and again without, and
got AI to analyse the set of 6 results to come up with more meaningful
conclusions after a little bit more numeric analysis. Below is some of the
summary.

While not necessarily a deal-breaker, the regressions seen are cause for
pause. We probably want to benchmark on a few other x86 (both Intel and
AMD) systems to see if this is a consistent picture.

/Bruce

---

Section-level picture (stable changes only):

Testing burst enq/deq:
  10 consistent regressions, 0 consistent improvements.
Testing bulk enq/deq:
  10 consistent regressions, 1 consistent improvement.
Testing using two physical cores:
  mixed, but regressions outnumber improvements (5 vs 2).
Zero-copy and compression sections:
  mostly inconclusive due high variance, with one stable regression.
Empty bulk deq and single-element:
  mixed small set, with isolated improvements and regressions.

Largest consistent regressions (examples):

  elem MP/MC burst n=32: -39.19%
  elem MP/MC bulk n=32: -37.84%
  elem MP/MC two-core bulk n=8: -36.58%
  elem MP/MC two-core bulk n=32: -29.46%
  elem MP/MC burst n=8: -29.43%
  legacy MP/MC burst n=8: -19.04%


Largest consistent improvements (examples):

  elem MP/MC two-core bulk n=128: +28.16%
  elem MP/MC two-core bulk n=256: +25.91%
  legacy SP/SC empty bulk deq n=8: +23.41%
  elem SP/SC bulk n=32: +16.05%
  legacy MP/MC single: +4.65%


Bottom line:

There is a real and mostly consistent regression trend in cycle-per-element
performance after replacing rte_atomicNN usage, especially in MP/MC burst
and bulk paths.  A few benchmarks improve, but they are fewer than
regressions.  All-worker total-count throughput appears statistically flat
in this 3-run sample.

Reply via email to