On Sun, Apr 06, 2025 at 10:16:49AM +0800, Nan Xiao wrote:
> Hi Paul,
>
> Thanks very much for your time and detailed explanation!
>
> > ... So, yes, a CAS operation happens to map to the single x86 cmpxchg
> > instruction, ...
>
> Maybe a nitpick, but from the book, a CAS operation is not mapped to
> the single x86 cmpxchg instruction, but a single x86 cmpxchg
> instruction with "lock" instruction prefix, right? Though I am not
> sure whether semicolon here matters or not: "lock cmpxchg" or "lock;
> cmpxchg".
Very much a philosophical point. However, other x86 instruction
prefixes are considered to be part of the following instruction, so I
feel comfortable treating "lock;" as a prefix. ;-)
Thanx, Paul
> Thanks!
>
> Best Regards
> Nan Xiao
>
> On Sun, Apr 6, 2025 at 1:19 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Sat, Apr 05, 2025 at 07:07:17PM +0800, Nan Xiao wrote:
> > > Hello,
> > >
> > > Greetings from me!
> >
> > And good to e-meet you!
> >
> > > I am reading "3.2.2 Costs of Operations" in perf book, and come
> > > across following words:
> > >
> > > > The same-CPU compare-and-swap (CAS) operation consumes about seven
> > > > nanoseconds, a duration more than ten times that of the clock period.
> > > > ......CAS functionality is provided by the lock; cmpxchg instruction on
> > > > x86.
> > > > ...... Similarly, the same-CPU lock operation (a “round trip” pair
> > > > consisting of a lock acquisition and release) consumes more than
> > > > fifteen nanoseconds,or more than thirty clock cycles. The Lock
> > > > Operation Is more expensive than CAS because it requires two atomic
> > > > operations on the lock data structure,
> > >
> > > So my question is for the "lock" operation in the above paragraph,
> > > does it mean "lock" instruction? Because the CAS functionality is
> > > "lock; cmpxchg" on x86, a single "lock" instruction should consume
> > > less time than "lock; cmpxchg". Or I misunderstood something? Thanks
> > > very much in advance!
> >
> > Good question!
> >
> > To see the answer, please keep in mind that although this book's
> > performance results are taken mostly from x86, the overall focus is
> > independent of architecture. So, yes, a CAS operation happens to map to
> > the single x86 cmpxchg instruction, but on 32-bit ARM it would map to
> > a sequence of instructions featuring load-linked and store-conditional
> > instructions.
> >
> > With this in mind, the "lock" in Table 3.1 is not the x86 "lock"
> > instruction prefix, but rather the acquisition and release of a spinlock.
> > In the Linux kernel, spin_lock() and spin_unlock(). In userspace,
> > pthread_mutex_lock() and pthread_mutex_unlock().
> >
> > For support for this view in the text, please see the sentences reading
> > as follows:
> >
> > Similarly, the same-CPU lock operation (a "round trip" pair
> > consisting of a lock acquisition and release) consumes more
> > than fifteen nanoseconds, or more than thirty clock cycles.
> > The lock operation is more expensive than CAS because it
> > requires two atomic operations on the lock data structure,
> > one for acquisition and the other for release.
> >
> > Does that help?
> >
> > Thanx, Paul