Hi Paul, Thanks very much for your time and detailed explanation!
> ... So, yes, a CAS operation happens to map to the single x86 cmpxchg > instruction, ... Maybe a nitpick, but from the book, a CAS operation is not mapped to the single x86 cmpxchg instruction, but a single x86 cmpxchg instruction with "lock" instruction prefix, right? Though I am not sure whether semicolon here matters or not: "lock cmpxchg" or "lock; cmpxchg". Thanks! Best Regards Nan Xiao On Sun, Apr 6, 2025 at 1:19 AM Paul E. McKenney <[email protected]> wrote: > > On Sat, Apr 05, 2025 at 07:07:17PM +0800, Nan Xiao wrote: > > Hello, > > > > Greetings from me! > > And good to e-meet you! > > > I am reading "3.2.2 Costs of Operations" in perf book, and come > > across following words: > > > > > The same-CPU compare-and-swap (CAS) operation consumes about seven > > > nanoseconds, a duration more than ten times that of the clock period. > > > ......CAS functionality is provided by the lock; cmpxchg instruction on > > > x86. > > > ...... Similarly, the same-CPU lock operation (a “round trip” pair > > > consisting of a lock acquisition and release) consumes more than fifteen > > > nanoseconds,or more than thirty clock cycles. The Lock Operation Is more > > > expensive than CAS because it requires two atomic operations on the lock > > > data structure, > > > > So my question is for the "lock" operation in the above paragraph, > > does it mean "lock" instruction? Because the CAS functionality is > > "lock; cmpxchg" on x86, a single "lock" instruction should consume > > less time than "lock; cmpxchg". Or I misunderstood something? Thanks > > very much in advance! > > Good question! > > To see the answer, please keep in mind that although this book's > performance results are taken mostly from x86, the overall focus is > independent of architecture. So, yes, a CAS operation happens to map to > the single x86 cmpxchg instruction, but on 32-bit ARM it would map to > a sequence of instructions featuring load-linked and store-conditional > instructions. > > With this in mind, the "lock" in Table 3.1 is not the x86 "lock" > instruction prefix, but rather the acquisition and release of a spinlock. > In the Linux kernel, spin_lock() and spin_unlock(). In userspace, > pthread_mutex_lock() and pthread_mutex_unlock(). > > For support for this view in the text, please see the sentences reading > as follows: > > Similarly, the same-CPU lock operation (a "round trip" pair > consisting of a lock acquisition and release) consumes more > than fifteen nanoseconds, or more than thirty clock cycles. > The lock operation is more expensive than CAS because it > requires two atomic operations on the lock data structure, > one for acquisition and the other for release. > > Does that help? > > Thanx, Paul
