On Thu, May 21, 2026 at 11:04:12AM -0700, Stephen Hemminger wrote: > The goal is to land every deprecation currently listed in the release > notes by the 26.11 ABI bump. Functions to be removed in 26.11 need to be > marked __rte_deprecated by 26.07, with all in-tree users converted off > them first so CI stays clean. > > This is the first step. After this series there are no remaining in-tree > users of rte_atomic64. Expected follow-ups: > > - convert remaining rte_atomic32 users (dpaa/fslmc, netvsc, vmbus, > sw_evdev, txgbe, ifc, hinic, bnx2x, vhost) - convert remaining > rte_atomic16 users (dpaa/fslmc, qman) - mark the rte_atomicNN_*() > family __rte_deprecated - remove the legacy test_atomic.c - remove the > API itself at 26.11 > > Patch 1 deletes the inline-asm atomic fallbacks across arm, ppc, > loongarch, riscv, and x86 now that RTE_FORCE_INTRINSICS has been the > default everywhere for years. Largest patch by line count and the one > most worth review attention. > > Patch 2 retires the rte_smp_*mb deprecation notice (open since 2021) by > reimplementing those APIs as inline wrappers over > rte_atomic_thread_fence; the API is preserved for readability. > > Patch 3 is the load-bearing change for lib/: the last caller of > rte_atomic32_cmpset() is converted, with explicit acquire/release > orderings matching the existing HTS/RTS ring pattern. > > Driver conversions (patches 4-11) match each rte_atomic64 use to its best > fit rather than blanket seq_cst: software stats become plain assignment > (DPDK convention, torn reads accepted); CAS loops setting a flag collapse > to fetch_or or exchange; open-coded link-status CAS in net/pfe and > net/sfc moves to the existing rte_eth_linkstatus helpers; genuine > synchronization stays atomic with explicit ordering. > > v2 - fix clang build - replace rte_atomic64 in more drivers - incorporate > feedback on rte_smp and ring - drop zxdh change (only caused by > intrinsics in spinlock) > > Stephen Hemminger (11): eal: use intrinsics for rte_atomic on all > platforms eal: reimplement rte_smp_*mb with rte_atomic_thread_fence ring: > use C11 atomic operations for MP/SP head/tail net/bonding: use stdatomic > net/nbl: remove unused rte_atomic16 field net/ena: replace use of > rte_atomicNN net/failsafe: convert to stdatomic net/enic: do not use > deprecated rte_atomic64 net/pfe: use ethdev linkstatus helpers net/sfc: > replace rte_atomic with stdatomic crypto/ccp: replace use of rte_atomic64 > with stdatomic > I decided to test this patchset with the ring_perf_autotest (using only two cores on same socket) to see how performance may be affected on x86 with this change. On an initial once-off test to compare performance with/without this patchset for MP/MC cases, it looks like smaller enq/deq burst e.g. 8/32 are slower after this set, while larger bursts e.g. 128/256 are slightly faster.
I then ran two more tests with the patches applied and again without, and got AI to analyse the set of 6 results to come up with more meaningful conclusions after a little bit more numeric analysis. Below is some of the summary. While not necessarily a deal-breaker, the regressions seen are cause for pause. We probably want to benchmark on a few other x86 (both Intel and AMD) systems to see if this is a consistent picture. /Bruce --- Section-level picture (stable changes only): Testing burst enq/deq: 10 consistent regressions, 0 consistent improvements. Testing bulk enq/deq: 10 consistent regressions, 1 consistent improvement. Testing using two physical cores: mixed, but regressions outnumber improvements (5 vs 2). Zero-copy and compression sections: mostly inconclusive due high variance, with one stable regression. Empty bulk deq and single-element: mixed small set, with isolated improvements and regressions. Largest consistent regressions (examples): elem MP/MC burst n=32: -39.19% elem MP/MC bulk n=32: -37.84% elem MP/MC two-core bulk n=8: -36.58% elem MP/MC two-core bulk n=32: -29.46% elem MP/MC burst n=8: -29.43% legacy MP/MC burst n=8: -19.04% Largest consistent improvements (examples): elem MP/MC two-core bulk n=128: +28.16% elem MP/MC two-core bulk n=256: +25.91% legacy SP/SC empty bulk deq n=8: +23.41% elem SP/SC bulk n=32: +16.05% legacy MP/MC single: +4.65% Bottom line: There is a real and mostly consistent regression trend in cycle-per-element performance after replacing rte_atomicNN usage, especially in MP/MC burst and bulk paths. A few benchmarks improve, but they are fewer than regressions. All-worker total-count throughput appears statistically flat in this 3-run sample.

