Hi Paul, Got it! Thanks again for your time and explanation!
Best Regards Nan Xiao On Sun, Apr 6, 2025 at 10:29 AM Paul E. McKenney <[email protected]> wrote: > > On Sun, Apr 06, 2025 at 10:16:49AM +0800, Nan Xiao wrote: > > Hi Paul, > > > > Thanks very much for your time and detailed explanation! > > > > > ... So, yes, a CAS operation happens to map to the single x86 cmpxchg > > > instruction, ... > > > > Maybe a nitpick, but from the book, a CAS operation is not mapped to > > the single x86 cmpxchg instruction, but a single x86 cmpxchg > > instruction with "lock" instruction prefix, right? Though I am not > > sure whether semicolon here matters or not: "lock cmpxchg" or "lock; > > cmpxchg". > > Very much a philosophical point. However, other x86 instruction > prefixes are considered to be part of the following instruction, so I > feel comfortable treating "lock;" as a prefix. ;-) > > Thanx, Paul > > > Thanks! > > > > Best Regards > > Nan Xiao > > > > On Sun, Apr 6, 2025 at 1:19 AM Paul E. McKenney <[email protected]> wrote: > > > > > > On Sat, Apr 05, 2025 at 07:07:17PM +0800, Nan Xiao wrote: > > > > Hello, > > > > > > > > Greetings from me! > > > > > > And good to e-meet you! > > > > > > > I am reading "3.2.2 Costs of Operations" in perf book, and come > > > > across following words: > > > > > > > > > The same-CPU compare-and-swap (CAS) operation consumes about seven > > > > > nanoseconds, a duration more than ten times that of the clock period. > > > > > ......CAS functionality is provided by the lock; cmpxchg instruction > > > > > on x86. > > > > > ...... Similarly, the same-CPU lock operation (a “round trip” pair > > > > > consisting of a lock acquisition and release) consumes more than > > > > > fifteen nanoseconds,or more than thirty clock cycles. The Lock > > > > > Operation Is more expensive than CAS because it requires two atomic > > > > > operations on the lock data structure, > > > > > > > > So my question is for the "lock" operation in the above paragraph, > > > > does it mean "lock" instruction? Because the CAS functionality is > > > > "lock; cmpxchg" on x86, a single "lock" instruction should consume > > > > less time than "lock; cmpxchg". Or I misunderstood something? Thanks > > > > very much in advance! > > > > > > Good question! > > > > > > To see the answer, please keep in mind that although this book's > > > performance results are taken mostly from x86, the overall focus is > > > independent of architecture. So, yes, a CAS operation happens to map to > > > the single x86 cmpxchg instruction, but on 32-bit ARM it would map to > > > a sequence of instructions featuring load-linked and store-conditional > > > instructions. > > > > > > With this in mind, the "lock" in Table 3.1 is not the x86 "lock" > > > instruction prefix, but rather the acquisition and release of a spinlock. > > > In the Linux kernel, spin_lock() and spin_unlock(). In userspace, > > > pthread_mutex_lock() and pthread_mutex_unlock(). > > > > > > For support for this view in the text, please see the sentences reading > > > as follows: > > > > > > Similarly, the same-CPU lock operation (a "round trip" pair > > > consisting of a lock acquisition and release) consumes more > > > than fifteen nanoseconds, or more than thirty clock cycles. > > > The lock operation is more expensive than CAS because it > > > requires two atomic operations on the lock data structure, > > > one for acquisition and the other for release. > > > > > > Does that help? > > > > > > Thanx, Paul
