On Mon, Mar 16, 2026 at 05:40:50PM +0000, Dmitry Ilvokhin wrote: [...]
> A possible generic solution is a trace_contended_release() for spin > locks, for example: > > if (trace_contended_release_enabled() && > atomic_read(&lock->val) & ~_Q_LOCKED_MASK) > trace_contended_release(lock); > > This might work on x86, but could increase code size and regress > performance on arches where spin_unlock() is inlined, such as arm64 > under !PREEMPTION. I took a stab at this idea and submitted an RFC [1]. The implementation builds on your earlier observation from Matthew that _raw_spin_unlock() is not inlined in most configurations. In those cases, when the tracepoint is disabled, this adds a single NOP on the fast path, with the conditional check staying out of line. The measured text size increase in this configuration is +983 bytes. For configurations where _raw_spin_unlock() is inlined, the instrumentation does increase code size more noticeably (+71 KB in my measurements), since the check and out of line call is replicated at each call site. This provides a generic release-side signal for contended locks, allowing: correlation of lock holders with waiters and measurement of contended hold times This RFC addressing the same visibility gap without introducing per-lock instrumentation. If this tradeoff is acceptable, this could be a generic alternative to lock-specific tracepoints. [1]: https://lore.kernel.org/all/51aad0415b78c5a39f2029722118fa01eac77538.1773858853.gi...@ilvokhin.com
