Hi Paul,

Got it! Thanks again for your time and explanation!

Best Regards
Nan Xiao

On Sun, Apr 6, 2025 at 10:29 AM Paul E. McKenney <[email protected]> wrote:
>
> On Sun, Apr 06, 2025 at 10:16:49AM +0800, Nan Xiao wrote:
> > Hi Paul,
> >
> > Thanks very much for your time and detailed explanation!
> >
> > > ... So, yes, a CAS operation happens to map to the single x86 cmpxchg 
> > > instruction, ...
> >
> > Maybe a nitpick, but from the book, a CAS operation is not mapped to
> > the single x86 cmpxchg instruction, but a single x86 cmpxchg
> > instruction with "lock" instruction prefix, right? Though I am not
> > sure whether semicolon here matters or not: "lock cmpxchg" or "lock;
> > cmpxchg".
>
> Very much a philosophical point.  However, other x86 instruction
> prefixes are considered to be part of the following instruction, so I
> feel comfortable treating "lock;" as a prefix.  ;-)
>
>                                                         Thanx, Paul
>
> > Thanks!
> >
> > Best Regards
> > Nan Xiao
> >
> > On Sun, Apr 6, 2025 at 1:19 AM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Sat, Apr 05, 2025 at 07:07:17PM +0800, Nan Xiao wrote:
> > > > Hello,
> > > >
> > > > Greetings from me!
> > >
> > > And good to e-meet you!
> > >
> > > > I am reading  "3.2.2 Costs of Operations" in perf book, and come
> > > > across following words:
> > > >
> > > > > The same-CPU compare-and-swap (CAS) operation consumes about seven 
> > > > > nanoseconds, a duration more than ten times that of the clock period. 
> > > > > ......CAS functionality is provided by the lock; cmpxchg instruction 
> > > > > on x86.
> > > > > ...... Similarly, the same-CPU lock operation (a “round trip” pair 
> > > > > consisting of a lock acquisition and release) consumes more than 
> > > > > fifteen nanoseconds,or more than thirty clock cycles. The Lock 
> > > > > Operation Is more expensive than CAS because it requires two atomic 
> > > > > operations on the lock data structure,
> > > >
> > > > So my question is for the "lock" operation in the above paragraph,
> > > > does it mean "lock" instruction? Because the CAS functionality is
> > > > "lock; cmpxchg" on x86, a single "lock" instruction should consume
> > > > less time than "lock; cmpxchg". Or I misunderstood something? Thanks
> > > > very much in advance!
> > >
> > > Good question!
> > >
> > > To see the answer, please keep in mind that although this book's
> > > performance results are taken mostly from x86, the overall focus is
> > > independent of architecture.  So, yes, a CAS operation happens to map to
> > > the single x86 cmpxchg instruction, but on 32-bit ARM it would map to
> > > a sequence of instructions featuring load-linked and store-conditional
> > > instructions.
> > >
> > > With this in mind, the "lock" in Table 3.1 is not the x86 "lock"
> > > instruction prefix, but rather the acquisition and release of a spinlock.
> > > In the Linux kernel, spin_lock() and spin_unlock().  In userspace,
> > > pthread_mutex_lock() and pthread_mutex_unlock().
> > >
> > > For support for this view in the text, please see the sentences reading
> > > as follows:
> > >
> > >         Similarly, the same-CPU lock operation (a "round trip" pair
> > >         consisting of a lock acquisition and release) consumes more
> > >         than fifteen nanoseconds, or more than thirty clock cycles.
> > >         The lock operation is more expensive than CAS because it
> > >         requires two atomic operations on the lock data structure,
> > >         one for acquisition and the other for release.
> > >
> > > Does that help?
> > >
> > >                                                         Thanx, Paul

Reply via email to