Hi Paul,

Thanks very much for your time and detailed explanation!

> ... So, yes, a CAS operation happens to map to the single x86 cmpxchg 
> instruction, ...

Maybe a nitpick, but from the book, a CAS operation is not mapped to
the single x86 cmpxchg instruction, but a single x86 cmpxchg
instruction with "lock" instruction prefix, right? Though I am not
sure whether semicolon here matters or not: "lock cmpxchg" or "lock;
cmpxchg".

Thanks!

Best Regards
Nan Xiao

On Sun, Apr 6, 2025 at 1:19 AM Paul E. McKenney <[email protected]> wrote:
>
> On Sat, Apr 05, 2025 at 07:07:17PM +0800, Nan Xiao wrote:
> > Hello,
> >
> > Greetings from me!
>
> And good to e-meet you!
>
> > I am reading  "3.2.2 Costs of Operations" in perf book, and come
> > across following words:
> >
> > > The same-CPU compare-and-swap (CAS) operation consumes about seven 
> > > nanoseconds, a duration more than ten times that of the clock period. 
> > > ......CAS functionality is provided by the lock; cmpxchg instruction on 
> > > x86.
> > > ...... Similarly, the same-CPU lock operation (a “round trip” pair 
> > > consisting of a lock acquisition and release) consumes more than fifteen 
> > > nanoseconds,or more than thirty clock cycles. The Lock Operation Is more 
> > > expensive than CAS because it requires two atomic operations on the lock 
> > > data structure,
> >
> > So my question is for the "lock" operation in the above paragraph,
> > does it mean "lock" instruction? Because the CAS functionality is
> > "lock; cmpxchg" on x86, a single "lock" instruction should consume
> > less time than "lock; cmpxchg". Or I misunderstood something? Thanks
> > very much in advance!
>
> Good question!
>
> To see the answer, please keep in mind that although this book's
> performance results are taken mostly from x86, the overall focus is
> independent of architecture.  So, yes, a CAS operation happens to map to
> the single x86 cmpxchg instruction, but on 32-bit ARM it would map to
> a sequence of instructions featuring load-linked and store-conditional
> instructions.
>
> With this in mind, the "lock" in Table 3.1 is not the x86 "lock"
> instruction prefix, but rather the acquisition and release of a spinlock.
> In the Linux kernel, spin_lock() and spin_unlock().  In userspace,
> pthread_mutex_lock() and pthread_mutex_unlock().
>
> For support for this view in the text, please see the sentences reading
> as follows:
>
>         Similarly, the same-CPU lock operation (a "round trip" pair
>         consisting of a lock acquisition and release) consumes more
>         than fifteen nanoseconds, or more than thirty clock cycles.
>         The lock operation is more expensive than CAS because it
>         requires two atomic operations on the lock data structure,
>         one for acquisition and the other for release.
>
> Does that help?
>
>                                                         Thanx, Paul

Reply via email to