Re: A question about "3.2.2 Costs of Operations" in perf book

Paul E. McKenney Sat, 05 Apr 2025 19:29:17 -0700

On Sun, Apr 06, 2025 at 10:16:49AM +0800, Nan Xiao wrote:
> Hi Paul,
> 
> Thanks very much for your time and detailed explanation!
> 
> > ... So, yes, a CAS operation happens to map to the single x86 cmpxchg 
> > instruction, ...
> 
> Maybe a nitpick, but from the book, a CAS operation is not mapped to
> the single x86 cmpxchg instruction, but a single x86 cmpxchg
> instruction with "lock" instruction prefix, right? Though I am not
> sure whether semicolon here matters or not: "lock cmpxchg" or "lock;
> cmpxchg".


Very much a philosophical point.  However, other x86 instruction
prefixes are considered to be part of the following instruction, so I
feel comfortable treating "lock;" as a prefix.  ;-)

                                                        Thanx, Paul

> Thanks!
> 
> Best Regards
> Nan Xiao
> 
> On Sun, Apr 6, 2025 at 1:19 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Sat, Apr 05, 2025 at 07:07:17PM +0800, Nan Xiao wrote:
> > > Hello,
> > >
> > > Greetings from me!
> >
> > And good to e-meet you!
> >
> > > I am reading  "3.2.2 Costs of Operations" in perf book, and come
> > > across following words:
> > >
> > > > The same-CPU compare-and-swap (CAS) operation consumes about seven 
> > > > nanoseconds, a duration more than ten times that of the clock period. 
> > > > ......CAS functionality is provided by the lock; cmpxchg instruction on 
> > > > x86.
> > > > ...... Similarly, the same-CPU lock operation (a “round trip” pair 
> > > > consisting of a lock acquisition and release) consumes more than 
> > > > fifteen nanoseconds,or more than thirty clock cycles. The Lock 
> > > > Operation Is more expensive than CAS because it requires two atomic 
> > > > operations on the lock data structure,
> > >
> > > So my question is for the "lock" operation in the above paragraph,
> > > does it mean "lock" instruction? Because the CAS functionality is
> > > "lock; cmpxchg" on x86, a single "lock" instruction should consume
> > > less time than "lock; cmpxchg". Or I misunderstood something? Thanks
> > > very much in advance!
> >
> > Good question!
> >
> > To see the answer, please keep in mind that although this book's
> > performance results are taken mostly from x86, the overall focus is
> > independent of architecture.  So, yes, a CAS operation happens to map to
> > the single x86 cmpxchg instruction, but on 32-bit ARM it would map to
> > a sequence of instructions featuring load-linked and store-conditional
> > instructions.
> >
> > With this in mind, the "lock" in Table 3.1 is not the x86 "lock"
> > instruction prefix, but rather the acquisition and release of a spinlock.
> > In the Linux kernel, spin_lock() and spin_unlock().  In userspace,
> > pthread_mutex_lock() and pthread_mutex_unlock().
> >
> > For support for this view in the text, please see the sentences reading
> > as follows:
> >
> >         Similarly, the same-CPU lock operation (a "round trip" pair
> >         consisting of a lock acquisition and release) consumes more
> >         than fifteen nanoseconds, or more than thirty clock cycles.
> >         The lock operation is more expensive than CAS because it
> >         requires two atomic operations on the lock data structure,
> >         one for acquisition and the other for release.
> >
> > Does that help?
> >
> >                                                         Thanx, Paul

Re: A question about "3.2.2 Costs of Operations" in perf book

Reply via email to