On 2025-11-02 16:44, Paul E. McKenney wrote:
Some arm64 platforms have slow per-CPU atomic operations, for example,
the Neoverse V2.  This commit therefore moves SRCU-fast from per-CPU
atomic operations to interrupt-disabled non-read-modify-write-atomic
atomic_read()/atomic_set() operations.  This works because
SRCU-fast-updown is not invoked from read-side primitives, which
means that if srcu_read_unlock_fast() NMI handlers.  This means that
srcu_read_lock_fast_updown() and srcu_read_unlock_fast_updown() can
exclude themselves and each other

This reduces the overhead of calls to srcu_read_lock_fast_updown() and
srcu_read_unlock_fast_updown() from about 100ns to about 12ns on an ARM
Neoverse V2.  Although this is not excellent compared to about 2ns on x86,
it sure beats 100ns.

This command was used to measure the overhead:

tools/testing/selftests/rcutorture/bin/kvm.sh --torture refscale --allcpus --duration 5 --configs 
NOPREEMPT --kconfig "CONFIG_NR_CPUS=64 CONFIG_TASKS_TRACE_RCU=y" --bootargs 
"refscale.loops=100000 refscale.guest_os_delay=5 refscale.nreaders=64 refscale.holdoff=30 
torture.disable_onoff_at_boot refscale.scale_type=srcu-fast-updown refscale.verbose_batched=8 
torture.verbose_sleep_frequency=8 torture.verbose_sleep_duration=8 refscale.nruns=100" 
--trust-make

Hi Paul,

At a high level, what are you trying to achieve with this ?

AFAIU, you are trying to remove the cost of atomics on per-cpu
data from srcu-fast read lock/unlock for frequent calls for
CONFIG_NEED_SRCU_NMI_SAFE=y, am I on the right track ?

[disclaimer: I've looked only briefly at your proposed patch.]
Then there are various other less specific approaches to consider
before introducing such architecture and use-case specific work-around.

One example is the libside (user level) rcu implementation which uses
two counters per cpu [1]. One counter is the rseq fast path, and the
second counter is for atomics (as fallback).

If the typical scenario we want to optimize for is thread context, we
can probably remove the atomic from the fast path with just preempt off
by partitioning the per-cpu counters further, one possibility being:

struct percpu_srcu_fast_pair {
        unsigned long lock, unlock;
};

struct percpu_srcu_fast {
        struct percpu_srcu_fast_pair thread;
        struct percpu_srcu_fast_pair irq;
};

And the grace period sums both thread and irq counters.

Thoughts ?

Thanks,

Mathieu

[1] https://github.com/compudj/libside/blob/master/src/rcu.h#L71

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Reply via email to